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ACOUSTIC  COmUTES  OP  VOICE  OUALITY  AND 
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By 
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Chalraan:  D.G. Childers 

Hajor  Departaenti  Electrical  Engineering 

The  main  purpose  of  this  research  is  to  find  acoustic  correlates  of 
voice  quality.  Tvo  groups  of  voices  are  considered:  noraal  and  deviant 
voices.  Anong  deviant  voices,  five  voice  quellties  are  studied:  overall 
severity,  hoarseness,  breathiness,  roughness,  and  vocal  fry. 

Tvelve  acoustic  parasieters  are  iapleaented-  They  represent  a 
coabinatlon  of  spectral  and  tlae-doealn  aeasures.  Dost  of  vhich  are 
extracted  froe  the  residue  signal  obtained  by  inverse  filtering  the 
speech  wave  using  the  LPC  technique.  The  voice  data  base  consists  of  52 
noraal  and  30  pathologic  voices.  All  the  subjects  are  ashed  to  phonate 
the  vovel  /!/  during  about  2 s. 

The  optisval  eethods  of  analysis  are  researched,  and  the  validity  of 
the  pitch  detection  scheee  la  tested.  A focaant  synthesiser  is  used  to 
study  the  relations  between  acoustic  paraneters. 

After  selecting  the  measures  that  ahov  a significant  difference 
betveen  high-quality  and  very  deviant  voices,  we  correlote  then  to  the 


scoc«(i  oE  a subjeetiva  lislaning  test.  The  results  of  this  correlation 
analysis  shov  that  it  is  difficult  to  predict  the  quality  of  noraal 
voices,  since  no  agreeeent  can  be  reached  anong  judges.  On  the 
contrary,  for  pathologic  voices,  the  five  voice  qualities  described 
above  can  be  quite  reliably  predicted  by  using  certain  acoustic 

Measures,  which  shows  that  source-related  factors  are  eore  iaportant 
than  spectral  characteristics  for  the  study  of  voice  quality. 

reliability  of  several  spectral  distortion  aeasures.  and  iepleaent  a new 
Euclidian  distance  sieasure.  based  on  the  acoustic  Measures  eentioned 


synthetic  voices  are  used. 

certain  voice  qualities  by  a certain  set  of  acoustic  psraaeters,  and 
that  we  can  benefit  fron  this  knowledge  to  design  new  distortion 
taeasures  used  in  speech  processing  systess. 


cf  neaaures)  Is  eassfillsl  to  the  development  ot  speech  processing 
systems,  including  the  iollovingi 

- speech  coding,  e.g.  hunan-to-hunan  voice  coBBunication  as  vith  Che 


- speech  synthesis,  e.g.  eachine-Io-huBen  coeununicaciont 

- speech  recognition,  e.g.  human-to-nachine  coniBunicationt 


perception  that  is  able  to  Bodel  the  ability  of  a I 
terms  of  measurable  spectral  or  vaveform  parameters, 
have  been  made  to  overcome  this  problem,  but  none  of 
totally  satisfactory  (Klatt,  1986). 

In  the  field  of  speech  pathology,  an  objective 
quality  might  assist  Che  pathologist  in  his  attempt  to 
physiology,  and  pathology.  Numerous  studies  have  b> 

varies  greatly,  depending  on  Che  type  of  pathology,  th 


umen  listener  in 
Various  acceopts 


speaher  gender. 


and  the  conditions  of  analysis  (Kasuya  et  al.,  1986a  and  1986b;  Huca  at 
al.,  1987). 

Unfortunately,  given  the  present  state  of  speech  science,  it  is 
unlikely  chat  one  single  quantitative  measure  of  speech  quality  vill  be 
as  effective  as  a skilled  human  listener.  flovever.  a conbination  of 
measures  might  give  satisfactory  results  for  a given  laryngeal  quality 
(Sansone  and  Emanuel,  1970;  Yumoto  et  al,,  1982).  One  of  the  goals  of 


breathlneds,  toughness,  boacseness,  and 

A breathy  voice  la  Indicative  oC  gone  ale  eacapage  through  a non- 
cloaed  glottla.  There  seens  to  be  an  Inverse  relatlcnahlp  between  the 
degree  ot  severity  of  breathlnesa  and  the  length  of  the  closed  glottal 
phase  during  vocal  fold  vibration  (Ladefoged,  1973). 

A rough  voice  is  a function  of  aperlodieity  of  vocal  fold  vibration 
and  la  typically  associated  with  low  fundaeental  frequency  of  voicing, 

escapage  of  air  and  an  apericdlclty  of  vocal  fold  vibration.  Vocal  fry 
is  the  sound  detected  in  harsh  voices,  and  is  also  associated  with  a 
very  low  vocal  pitch. 

behavior;  consequently  eeasuces  related  to  such  behavior  should 
correlate  well  to  these  various  voice  qualities.  PurthecBore,  various 
eaperiaents  using  synthetic  speech  {Childers  and  Vu,  199d|  Pinto,  1997) 
have  shovn  that  these  qualities  could  be  obtained  by  varying  the  source 
excitation  and  keeping  the  vocal  tract  pacapieiers  constant. 

The  problen,  then,  is  to  select  an  appropriate  set  of  acoustic 
measures.  Several  seasuree  are  referenced  in  the  literature  {Koike  et 
al.,  1977;  Davis,  1976;  Yusolo  et  al.,  1982;  Kasuya  et  al.,  19B6c),  and 
many  siore  are  still  in  a development  stage.  Tvo  things  guided  us  In 
selecting  appropriate  neasuces: 

- mixing  spectral  and  tlme>domaln  measures,  since  these  tvo  categories 
perform  veil  under  the  appropriate  conditions; 

are  studied. 


Davis  (1976)  and  perforoed 


spectral  and  tine-doinain  acoustic  measures.  These  measures  had  been 

base.  At  first,  ve  were  only  interested  in  testing  the  performance  of 
these  measures  in  discriminating  between  normal  high  quality  voices  and 
very  deviant  voices;  ve  adopted  the  six  most  performant  measures  for  the 
rest  of  the  study.  These  measures  were  then  correlated  to  the  results 
of  a formal  listener's  test,  for  each  of  the  following  categories: 
ovetall  severity,  hoarseness,  breathiness,  roughness  and  vocal  fry,  for 
pathologic  voices,  and  overall  excellence  for  normal  voices  (for  this 
latter  category,  male  and  female  voices  were  treated  separately). 

Using  the  formant  synthesiser,  several  relations  vere  investigated 
between  the  various  measures  considered,  and  our  results  vere  validated 

then  be  designed. 

This  study  should  be  of  interest  to  both  the  engineer  and  Che 
speech  scientist,  both  of  whom  have  an  interest  in  developing  measures 
that  could  discriminate  between  good  and  bad  quality  voices.  A review 
of  past  studies  is  given  below. 

1.2  Speech  Production  Model 

Figures  1.1  and  1.2  present  the  human  vocal  system  and  a 
corresponding  schematic  diagram  respectively.  As  noted  in  Childers 


(1977).  the  Laryngeal  function  ia  to  provide  a protective  closure  for 
the  respiratory  system.  The  lungs  and  associated  respiratory  muscles 
are  the  vocal  pover  supply.  For  voiced  sounds  (such  as  vovels),  the  air 
pressure  Increases  until  the  folds  are  pushed  apart  forming  a slit  knovn 
as  the  gloilis!  the  espelled  air  causes  the  vocal  cords  to  vibrate  as  a 
relaxation  oscillator.  This  vibratory  notion  is  the  result  of  the 
interaction  of  tuo  forces:  one  is  the  subglottsl  pressure  which  causes 
air  to  push  the  adducted  folds  apart,  releasing  a puff  of  alri  the  ocher 
force  is  knovn  as  Che  Bernoulli  effect  (the  product  of  pressure  and 
velocity  squared  Is  constant)  which  la  a suction  phenonenon  that  pulls 
the  cords  together  when  the  air  velocity  through  Che  glottis  la 
relatively  Large,  and  causes  the  glottis  to  close  when  it  is  initially 


The  succession  of  air  pulses  generated  by  the  vibratory  motion  of 

generated  has  an  auditory  correlate  (pitch)  directly  related  to  the 
frequency  of  oscillation  of  the  vocal  folds,  and  a loudness  which  is 
determined  by  the  amplitude  of  the  acoustic  pressure  wave.  The 
frequency  of  oscillation  of  the  vocal  folds  is  deternined  by  their  aass. 
thickness,  elasticity,  and  compliance,  as  veil  as  by  the  subglottal 
pressure. 


The  volume-velocity  (l.e.  the  rate  of  air  flow),  after  passing 
through  the  pharynx,  mouth,  and  nasal  cavities,  ia  finally  radiated  froa 
the  lips  and  nostrils  as  voiced  sounds.  This  voluae-velocity  Is 
extremely  difficult  to  measure,  and  is  still  the  object  of  various 
studies  (Javkin  at  al. , 19B7). 


Fricative  or  unvoiced  sounds  are  generated  by  Corning  a 
constriction  st  some  ^olnt  In  the  vocal  tract  and  forcing  air  through 
Che  constriction  at  high  enough  velocity  to  produce  turhulencet  this 
creates  a broad-spectrun  noise  source  to  excite  the  vocal  tract. 
Plosive  sounds  result  Iron  neking  a coisplele  closure*  building  up 
pressure  behind  the  closure,  and  abruptly  releasing  It. 


£ nonunlfom 


The  vocal  tract  and  nasal  tract  can  be  seen  as  tubea  o 
cross-sectional  area.  As  the  air  Clov  propagates 
Creguency  spectrun  la  shaped  by  the  frequency  selectivity  of 
(Rablner  and  Schafer,  1976).  The  resonance  frequencies  of  t 

and  dleensions  of  Che  vocal  tract.  Different  sounds  are  foreed  by 
varying  the  shape  of  the  vocal  tract. 

The  Interaction  between  the  source  and  the  vocal  tract  has  been 
ignored  In  aost  speech  models,  but  needs  to  be  accounted  for  If  a high 
degree  of  naturalness  is  to  N achieved  with  speech  synthesizers 
(hothenberg,  19B1;  Childers  and  Vu.  19BB).  As  noted  by  Rothenberg 
(19B1).  "it  is  generally  realized  that  there  can  be  appreciable  first 
formant  energy  absorbed  by  the  glottis  during  Che  open  phase  of  the 

glottal  flow  (volume  velocity}  waveform  and  a change  in  the  frequency 

formant  bandwldths.  On  the  other  hand,  these  variations  of  the  vocal 
tract  filter  seem  to  alter  the  volume  velocity*  making  It  unsyametrlc 
and  causing  some  glottal  ripple  (i.e.  spurious  peaks  In  the  glottal 


waveform) . 


Si>egch  Perception  Model 


Speech  quality  Is  coupled  to  the  husian  spaach  parceptlon  process. 
The  design  of  distance  aeasucas  should  be  based  on  speech  perception.  In 

distances  and  Che  listeners'  subjective  Impressions. 

rather  fire  basis,  thanks  primarily  to  the  experiments  carried  out  by  G. 
von  Bekesy,  vho  received  the  Nobel  Prise  In  1961.  In  contrast,  as  noted 

activity.  5til 
information  to  t 
87). 

still  possible  t 
and  eeasuring  t 


e ultimate  mechanism  of  perception”  (p. 


quantify  certain  aspects  i 
e subjective  behavior  of 


f perception,  by  observing 
listeners  in  response  to 


prescribed  auditory  stimuli.  Hany  such  studies  have 


have  proven  quite  successful  in  emphasizing  various  aspects  of  the  huoian 
perception  process.  But  the  conclusion  of  most  research  is  that  the 

perception  of  phonetic  distances  and  that  higher  level  processes  must  be 

Nielsen.  1988)  might  ansver  this  need. 

The  concept  that  the  ear  can  be  compared  vith  a frequency  analyzer 


has  been  accepted  for  a long  time.  «hen  designing  the  auditory  filler, 

Such  a factor  is,  for  example,  the  auditory  masking  (Zvialocki,  1976), 
vhich  Bay  be  defined  as  decreased  audibility  of  one  sound  due  to  the 
presence  of  another  sound.  Partial  masking  (Zvlalocki,  1978)  la  used 
for  situations  vhere  the  Basked  sound  is  clearly  audible,  but  its 
loudness  is  decreased.  Porvard  masking  is  applied  to  stlBuli  folloving 
the  Basking  sound,  whereas  backward  Basking  refers  Co  stlBuli  preceding 
It.  Another  inportani  constituent  of  the  ear's  characteristics  is 
called  "lateral  suppression"  (Ploep,  1976)  and  can  be  defined  as  the 
decrease  in  loudness  with  respect  to  the  sound-pressure  level.  Finally, 

sufficient  for  generating  perceptually  identical  signals  explains  the 

production  (Schroeder  and  Atal,  1965). 

Some  researchers  are  now  trying  to  take  these  factors  into  account 
when  trying  to  code  or  synthesise  the  speech  signal  (Schroeder  and  Atal, 

FroB  his  research,  Klatt  (1982)  concludes  that  only  focBanc 
frequency  changes  contribute  significantly  to  phonetic  change  whereas 
changes  in  spectral  tilt,  formant  amplitude,  low-pass,  and  notch 
filtering  are  less  inportant.  Ve  presently  lack  the  knowledge  necessary 
to  Include  the  information  in  our  design  of  quantitative  quality 


study  prssentlns 


Pscturbstiou  fJittsc) 

cst  studies  concerning  the  vibratory  pattern  of  the  vocal 
he  Laryngeal  vibrations  (Tincke  et 


ough  the  pathologic  larynx  had  already  been  the 
IS  (TiBcke  et  al.*  the  first  significant 
! objective  data  for  the  dlscrlBination  between 


normal  and  pathologic  voices  was  presented  by  Ueberaan  (1961,  1963). 
He  examined  the  variations  in  the  fundaaental  periodicity  of  the 
acoustic  waveform  (genetically  called  "jitter’),  and  found  a change  in 
period  over  three  succeslve  cycles  in  the  speech  of  normal  speakers  66t 
of  the  time.  He  also  Introduced  the  concept  of  Pitch  Perturbation 
Factor,  which  is  defined  as  the  percentage  of  pitch  period  perturbations 
(l.e.  the  time  difference  between  successive  cycles)  that  exceed  0.3  ms. 
This  factor  proved  to  be  sensitive  to  cerlsin  types  of  pathologies, 
However,  one  of  the  weaknesses  of  this  study  was  the  low  resolution 
sampling  rate,  equal  to  0,2  as.  Tha  sampling  frequency  needed  to  study 
jitter  is  still  an  open  pcoblent  Tltae  el  al.  (1967)  suggested  that  at 
rather  low  saapling  frequencies,  the  use  of  interpolating  techniques 
might  be  necessary.  In  a separate  study.  HlLenkovic  (1967)  suggested 
that,  on  the  contrary,  low  sampling  frequencies  yield  better  results 

Since  the  study  by  Liebertnan  (1961),  numerous  definitions  of  jitter 
have  appeared  in  tha  literature,  most  of  which  will  be  detailed  in 
Chapter  2,  and  many  studies  have  attempted  to  assess  the  rellebllliy  of 
1966;  Hollien  et  al.,  1973;  Borii,  1979; 


Snlth  et  ml.t  1978f  Sorensen  end  Horii,  1984;  Hllenkovic,  1987;  Kllnholz 
end  Hertin,  198S).  Unfortunately,  there  is  no  unanimous  agreesient  about 
the  efficiency  of  jitter  in  diacriainating  betveen  normal  and  pathologic 
voices.  One  feet,  though,  appears  nov  to  he  clear:  in  order  for 

Jitter  (Childers  and  Vu.  1998). 


variations  of  the  fundanental 

videly  investigated  (Eorii,  1980;  Kasuya  et 
Kllnholz  and  Hertin,  1963;  Uidlov 
e results  have  generally  proven  less  suceesaful 
though  pathologic  voices  tend  to  exhibit  higher 
noraal  voices. 


the  signal'to-noise  ratio  concept,  such  as  the  Harmonics^to-Noise  Ratio 
or  the  Noraalised  Noise  Energy  (Kltajiaa,  1981;  Kasuya  et  al. , 1986; 
Sansone  and  Eaanuel,  1970;  Yuaoto  et  al-.  1982).  In  addition,  spectral 

Handersson,  1987)  have  been  developed. 

Since  the  study  by  Lieberaan,  various  studies  have  been  concerned 

Harkel,  1975;  Hiki  et  el.,  1976;  Childers,  1977;  Koike  el  al.,  1977; 


Dsvls,  1976}  HaoiBsrbecg  ec  al.,  1980;  Di 
19B4;  Kasuya  at  al.,  1986a  and  19866). 
proBlaing  reaulta  in  discciainating  bai 
type  oC  pathology,  and  aost  of  then  havi 
analyaie  techniques.  Howevec.  none  c 


n nornal  voices  and  a certain 
Ided  nev  tools  to  the  existing 
then  has  exhibited  a totally 


contradictory  results 


4 Distance  Measures 

Going  beyond  the  scope  of  pathology  detection,  the  engineer  Is 

earlier,  this  represents  a oajor  problea  In  the  field  of  speech 
synthesis;  not  only  vould  objective  measures  help  evaluate  the 
naturalness  of  a synthesizer  or  a vocoder,  but  the  concepts  behind  these 
measutes  vould  also  help  us  understand  the  speech  production  process, 
and  hence  develop  systems  that  vould  more  closely  natch  human  speech 

quality  of  a speech  system  (Barnvell  and  Bush,  1978;  Barnvell,  1979; 
Barnwell  and  Quackenbush,  1982);  the  basic  systee  is  described  In  Figure 
1.4.  During  this  tine,  several  distortion  measures  have  been 
Implceented  (Gray  et  al.,  1980),  nost  of  them  spectral  distortion 
measures,  the  aost  famous  one  certainly  being  the  Itakuro-Salto 
distortion  measure. 

One  of  the  most  comprehensive  study  to  date,  the  study  by  Barnvell 


(1985),  correlates 


results  of 


listening 


Diagnostic  Acceptability  Heaaure  (Voiera,  1977;  Quaebanbuah,  1986) — to 
the  results  of  different  eeasures  applied  to  various  types  of  coding 
dlstortionsi  hovever,  even  though  some  of  the  distortions  caused  by 
specific  coding  techniques  were  successfully  detected  by  certain 
measures,  others  were  not.  and  no  general  pattern  or  any  new  knowledge 
of  the  human  speech  production  model  could  be  gained.  The  eost 
efficient  distance  measure  proved  to  be  the  Klatt  distance  measure 
(Klatt,  1982). 


In  the  field  of  speech  recognition,  two  techniques  emerge  as  the 
moat  Important!  Dynamic  Time  Varping  <DTV)  and  Hidden  Harkov  Models 
(HHB)  (Jueng.  1984ai  Levinson.  1985).  DTW  is  a non-parametrlc  technique 

signal  (or  perspective  template)  to  determine  which  one  corresponds  to 
the  reference  signal,  whereas  KKK  Is  a parametric  technique  that  uses  a 
statistical  approach  (Levinson  et  al..  L9B1(  Rablner  and  Juang.  1986) 
and  la  a relatively  new  technique  still  in  development.  The  need  for 
new  distances  to  be  used  in  the  Dynamic  Time  Varping  approach  Is  still 
critical  (Nocerlno  et  al.,  1985a  and  1985b).  No  distance  measures  based 

described  in  the  litecalure. 


1.4.5  New  Analysis  Techniques 

Finally,  the  growing  Interest  among  the  speech  scientific  community 
for  the  Electroglottograph  or  EGG  (Childers  et  al.,  1983)  has  led  to  the 
development  of  new  analysis  techniques  (Childers  and  Larar,  1984)  Smith 
and  Childers,  1983;  Krlshnamurthy  and  Childers,  1986)  that  help  improve 


Figure  l.e:  Figure  of 


for  an  objective  quality 


lateral  contact  between  the  vocal 
detection  algorlthris,  for  exaeple. 


1.5  Deacrlption  of  Chapters 

The  present  study  departs  froa  those  conducted  in  the  past.  Ve 
intend  to  design  Euclidian  distance  measures  based  on  both  spectral  and 
time-dcBialn  parameters,  and  to  check  their  efficiency  using  the  results 

various  hypotheses.  Our  goal  is  not  to  find  measures  that  discrininate 
between  normal  and  pathologic  voices,  but  rather  to  contribute  to  the 

of  view. 

Chapter  2 gives  an  in-depth  description  of  the  various  acoustic 
measures  used,  as  well  as  their  interdependencies.  Chapter  3 presents 
the  date  baee  {normal  and  pathologic  voices),  and  the  efficiency  of  the 
various  acoustic  measures  In  discriminating  between  norasl  and 
pathologic  voices;  the  conditions  of  analysis  are  also  carefully 
examined  and  compared  to  those  used  in  previews  studies.  In  Chapter  4, 
ve  describe  the  listening  test  used,  and  discuss  the  results  of  a 
multiple  regression  between  the  ratings  of  the  listeners'  test  and  the 
data  obtained  from  the  most  efficient  aoeustlc  measures.  This  Is  done 
for  both  normal  and  pathologic  voices.  In  Chapter  5,  several  distance 
measures  are  described,  and  a new  Euclidian  dlsianca  measure  is 
designed.  These  measures  are  tested  on  both  normal  and  synthetic 
of  these  tests  are  dlscusaed.  FlneLly,  Chapter 
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results  and  Che  possible 


Linear  Predictive  Coding  o 
Then,  various  acoustic  neat 

- tiae-doaain  eeasures 


speech,  vhlch  vill  be  use 


id  in  Che  literature  i 
foreant  synthesizer,  c 


2.1.1  Modeling  of  the  Buaan  SMech  Production  Systea 

The  nethod  of  linear  predictive  analysis  has  became  the  predoelnant 
technique  for  estimating  the  basic  speech  parameters,  e.g,,  pitch, 
formants,  spectra,  and  vocal  tract  area  functions.  This  model,  as  shovn 
in  Figure  2.1,  is  composed  of  three  acoustic  filters.  The  primary 
assumptions  of  the  model  are  that  it  is  an  all-pole  model  and  Chet  it  is 
assumed  to  be  linear  and  time  invariant.  The  latter  assumption  is  a 
reasonably  good  approximation  over  a range  of  10  Co  20  ms.  It  is 
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aeparcible 


uncoupled,  vhich  is  also  one  of  the  drawbacks  of  this  aodel.  The  source 


Exclceclon  Hodel.  For  voiced  sounds,  the  excitation  Is  modeled  as 
a series  of  spaced  unit  iBpulses  spaced  at  the  pitch  period;  its 


(2.1) 


where  Uq  is  a scale  factor,  £(n)  Is  the  Kronecker  Delta  function  (d(n)-l 
if  n-0  and  d(n)>0  If  naO>.  The  s-transforn  of  this  function  Is  given  by 


<2. 2) 


where  P is  the  pitch  period  (in  saaples). 


Glottal  Shaping  Filter,  This  filter  has  been  represented  by  an 
all'POle  model  (Flanagan,  1972),  and  one  of  Its  simplest  representation 
Is  given  by  (Davis,  1976) 

gs(n)  . So(n*l)e-=>’’t,  n.0,1,-..  (2,3) 

where  Sq  is  a scale  factor  and  e"^f  is  the  location  of  a pole  in  the 
unit  circle  in  the  s-plane.  The  z-transforn  is  given  by 
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LOVPASS 


Volume 

Velocity 


Lip  Speech 

Veloci ty 


Plfure  2.1:  Hodel  tor  speech  production 


Figure  2.2:  linear  prediction 


filter,  the  output  Is 


g{B)  . UoSo  £ (ii.l-jP)e-«(n-Jf),  <2. 5) 

Jrt 

nueber  of  nerrov  bandwidth  conplex  poles;  its  expreaslon  la  given  by 


vhere  a spectral  resonance  oE  center  frequency  fj  and  bandwidth  b^  is 
defined  by  each  coaplex  pole  pair  (z^.s^*),  where  aj  ■ exp<Rb5T-j2nfiT), 
ej*  is  the  conplex  conjugate  of  zj,  and  T is  the  saapllng  period  <Davls. 
1976).  These  resonances  are  the  foreants  of  the  vocal  tract.  The 
output  of  this  filter,  with  Input  g{n).  Is  the  lip  volune  velocity  v(n). 


Lip  Radiation  Hodel,  In  order  to  obtain  a eodel  of  pressure  at  the 
lips,  then  the  effects  of  radiation  Bust  be  Included.  A reasonable 
approxinatlon  to  the  radiation  eEEects  is  given  by 

R(a)  . Po<l-ua-‘)  (2.7) 
pole  ranging  Eroa  .9b  to  .99  in  the  z-plane. 


Coapleti 


e Kpdel.  By  cascading  the  previous  filters,  we  obtain  the 
following  BodeL  for  the  acoustic  sound  pressure  transforn  S(Z): 

S(t)  - R(z)VT(r)GS(a)U(z)  (2.S) 


n (1  - e-"'>jTcps(2nfiT)i-l  . ,-2sb^T,-2j 


In  face,  this  aodel  can  be  greatly  simplified  by  taking  Into 
account  a few  considerations.  The  numerator  factor  (l~uz~^)  cancels 
approximately  one  of  the  nueerator  factors  (l-e-'^e-*)*' , since  cT  la 
generally  buch  leas  than  unity  (Davis,  1976)i  then,  the  renalnlng 
nuBerator  factors  (l.e.  (l-e"'^l'i-l)-'(l-s"f)'*)  can  be  Included  in  the 
denominator.  This  yields  the  following  transfer  function: 

B(z)  . (2.10) 

where  a|,  is  real,  p.2K»i,  10»16,  and  G Is  called  the  gain.  This  gain 
natches  the  energy  or  average  value  of  the  input  spectrue  to  the  oodel 
spectrue.  Except  where  noted,  we  will  set  the  gain  equal  to  unity. 

Thus,  the  speech  sound  la  modeled  as  the  output  of  an  all-pole 
filter,  which  agrees  with  our  knowledge  of  speech  perception,  l.e.,  "the 
human  auditory  system  is  considerably  more  sensitive  to  the  location  of 
a pole  than  to  the  location  of  a aero"  (Childers,  1977,  p.  396). 


The  whole  problen  of  llneac  prediction  is  then  to  find  the  filter 
coefficients  l^<Pi  and  the  filter  order,  p,  according  to  a 

prescribed  error  criterion. 

2.1.2  Solution  to  the  Linear  Prediction  Problen 

Figure  2.2  presents  the  aodel  of  Linear  Prediction,  He  know  that 
S{s).a(r)«(e),  or 

S(2)  . If  G.l  (2.11) 

1 


u„  - Sn  - I a^Sn.^  • s„  - s„  (2.12) 

to  find  S}(,  1^^,  ve  elnlelxe  the  total-squared  error  E,  where 

In  order  to  elnlelse  S.  ve  let  — ■ 0,  l<i^  (2.14) 

*®1 

Substituting  (2.13)  into  (2.14),  ve  get 

£ Sn-k»n-l  ■ 1 SnSn-i  lSi»  (2.15) 

If  the  range  of  sueisation  in  Eq.  (2.15)  is  taken  over  a finite 
Interval,  0^41-1  (N  Is  sone  integer),  but  p+N  consecutive  signal 
samples  are  available,  the  eethod  is  called  the  covariance  method.  Eq. 


(2.15) 


£ ai^Cik  . Cjo,  Klfi) 


Mthod,  a|(  Is  Idsntlcslly  z«co  outside  the  interval  Ue 
define  s'^  ■ where  Vp  is  a finite  length  window  equal  t 
outside  the  interval  O^^-l. 

Using  these  conditions  in  Eq.  (2.1S)t  we  get 


= Mli-kl  ■ »!  >«*?!■ 


Butococielation  aethod  is  stationary  (Childers,  1977>. 


used  to  obtain  acoustic 
first  aethod  is  called 
e excitation  signal  u(n) 


2,1.3  Inverse  Filtering 

Two  methods  of  inverse  filtering  can  I 
inforstatlon  about  the  glottal  source,  Tii 
"residue  inverse  filtering,"  vhich  estimates 
by  inverse  filtering  the  speech  signal  (Figures  2.3  and  2.4  present  an 
exaeple  of  residue  signal  for  a normal  and  a pathologic  speaker).  This 
method  is  based  on  the  eodel  presented  above.  The  second  method  is 
called  "glottal  inverse  filtering,"  vhere  the  inverse  of  the  lip 


cadiBCion 


spectral  concrlbutlona 


glottal  volume  velocity  g(n>t  it  Is  also  baaed  on  the  linear  model 
presented  above. 

The  latter  method  Is  a difficult  one  to  iaipleeeni,  and  no 
satiafying  automatic  impletaeniatlon  is  knovn  to  date.  However,  soae 
studies  have  been  made  fJavkln  et  a]..  1967}  and  various  pacameters  of 
the  glottal  volume  velocity  ace  being  Investigated  foe  early  detection 
of  laryngeal  pathology. 

On  the  other  hand,  the  residue  Inverse  filtering  is  relatively  easy 
to  implement.  As  shovn  in  the  previous  section,  the  Impulse  trein  (or 
error  function)  at  the  Input  of  the  glottel  source  filter  is  given  by 
Che  formula 


which  means  that  U|,  has  been  deconvolved  from  the  speech  signal.  Then, 
the  problem  Is,  once  again,  to  find  ai^,  1^^,  and  p.  In  this  study,  wa 
use  the  autocorrelation  method  to  solve  this  problem. 

In  fact,  various  studies  have  demonsTraced  Che  usefulness  of  the 
residue  signal  to  detect  laryngeal  pathology  (Koike  and  Harkel,  1975; 
Koike  ec  el.,  1977;  Oavls,  1976),  vhich  is  a good  Indication  that  this 
signal  carries  some  information  about  Che  quality  of  the  voice. 
However,  it  is  true  that  the  residue  signal  in  itself  is  not 
physiologically  present,  and  it  is  difficult  to  understand  how  it  can  be 
related  to  laryngeal  function.  Vhat  we  must  consider,  on  Che  other 
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Figure  2.3:  Speech  and  residue  signals  of  a nocsial  speaker 
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Fi^r«  2.6i  Speich  and  residua  signals  of  a pathologic  spaakar 


coipute  tti«  pitch  pccScij  In  the  case  of  vowels— Rablnec  anii  Schafer 
<197Q}  note  that  this  Is  not  always  the  case  foe  sounds  which  are  not 
rich  In  hacBonlc  structure — as  will  be  detoonstrated  later  in  this  study* 
This  is  a very  leportant  factor  since  pitch  perturbation  aeasurcinenis 
are  a doalnant  part  of  this  study. 

Bence,  based  on  previous  studies  (Davis,  1976i  Tanaglhara.  1967| 
Tunoto  et  al.,  1982!  Yunoto  ec  al.,  19B&)  and  on  the  fact  that  seasurea 
using  the  residue  and  speech  signals  are  generally  easy  to  implement,  we 
decided  to  Include  both  of  these  signals  for  our  study. 


2.2  Acoustic  Measures 

In  this  section,  several  acoustic  neasures  designed  to  evaluate 
speech  quality  are  presented.  Two  types  of  aeasures  are  describedi 
spectral  doaaln  measures,  and  tine  donaln  measures,  A sub-type  of  the 
latter,  composed  of  pitch  perturbation  neasures,  la  described 
aepaca tely. 

As  mentioned  earlier,  the  literature  provides  a vide  range  of 


goal  was  to  implement  a vide  variety  of  neasures,  ai 
performance  when  applied  to  voices  that  lie  at  the  t 
quality  continuum  (i.e.  high-quality  voices  and  < 
Then,  the  measures  chat  best  discriminate  between 
voices  can  be  hepc  for  further  analysis. 

The  definition  and  range  of  all  the  measures 
presented  in  Tables  2.1  and  2.2. 


e quail  ties. 
Co  evaluate 


OEPINITIONS 

SPP 

u 

PA 

HMR 

pro 

t™  a a.oothed  trand  liAa 

APO 

A.era«A  .£  cycl.  .0  .ycU  p„ch  perturbation. 

SHIHHBR 

Average  of  cycle  to  cycle  aaplicude  perturb.tietis  (in  dB> 

JJIT 

spectral  DoMlti  BeasurM 


These  measures  ere  used  to  measure  the  vhltenlng  effect  of  the 
inverse  filter  when  applied  to  the  speech  spectnio.  Grey  end  Kerkel 
(197S)  shoved  that  maxlelslng  the  spectral  flatness  of  the  inverse 
filter  output  is  equivalent  to  minimising  the  energy  of  the  residue 
signal.  Hence,  the  more  nclse-like  a signal  will  appear,  the  larger  its 
spectral  flatness  measure  vlll  be. 

Spectral  Flatness  is  defined  as  the  ratio  in  decibels  of  the 
geoeeirlc  mean  of  the  spectrum  to  the  arithmetic  mean  of  the  spectrum, 
having  a maximum  value  of  OdB  for  a constant  spectrum. 


the  a-transforns  of  the  residue  signal,  the  speech  signal,  and  Che 
inverse  filter  coefficients),  then  SP(B)  . SF(S)  - SP(1/A),  vhere  8F<B) 
and  SP<1/A)  ace.  respectively,  the  spectral  flatnesses  of  the  residue 
signal  (or  SPR)  and  of  the  inverse  filter  (or  SPP).  Since  SPF  and  SFR 
are  both  negative  numbers,  a very  large  negative  number  corresponds  to  a 
good  voice,  vhereas  a email  negative  number  corresponds  Co  a bad  voice. 

Physiologically,  the  spectral  flatness  detects  the  presence  of 
noise  in  the  pathological  speech  (Oavis,  1976),  Davis  also  suggests 
that  the  SPP  {Spectral  Flatness  of  the  inverse  Filter)  measures  the 
masking  of  formant  frequency  amplitudes  and  bandvidths  by  noise,  vhereas 
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the  SPR  (Spectral  PLacness  of  the  Residue  slRhal)  eeasuree  the  neslclng 
of  Eundanenlal  frequency  haraonic  amplitudes  by  noise.  If  the  vocal 
tract  tube  Is  considered  constant  and  Independent  of  the  source 
excitation,  changes  In  the  foraant  hehavior  Bay  be  attributed  to  changes 
In  the  fundamental  frequency  harmonic  amplitudes  of  the  glottal  volume 
velocity  vavefora. 

By  defining  the  Open  Quotient  (00)  of  the  volume  velocity  as 
pulse  duration 


cycle  duration 


(2-22) 


ve  can  relate  the  changes  in  spectral  flatness  to  the  changes  in  the  00 
since  the  00  Increases  as  the  saplltudes  of  the  higher  harmonics  of  the 
signal  decrease  (Davis,  1976).  Renee,  SFP  and  SPR  are  Indicative  of 
variations  in  the  periodic  behavior  of  the  glottal  volume  velocity 
vavefora  and  have  a good  chance  to  be  related  to  pathological  conditions 


2.2.2  Time  Domain  Heasurea 

Coeftlclent  of  Excess.  Koike  and  Harkel  (1975)  and  Davis  (1976) 
shoved  that  the  Signal-to-Nolse  Ratio  (SHR)  of  the  residue  signal  is 
qualitatively  correlated  to  the  degree  of  severity  of  Che  disease, 
however,  for  pathological  residue  signals,  the  pitch  period  is  not 
alvays  distinct  from  Che  noise,  and  the  Signal-to-Noise  ratio  Is  almost 
lapossihle  Co  compute.  Hovever,  Davis  (1976)  made  the  obeervaclon  that 
the  amplitude  distribution  (obtained  by  partitioning  the  values  of  the 
residue  signal  among  equally  spaced  intervals  and  counting  the  percent 
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each  interval}  o£  a normal  residue  signal  is  no 
ooncentraied  around  zero,  uhereas  that  of  a pathological  one  is  nio 
equally  spread  (see  Figures  2.3  and  2.6}. 

This  difference  Is  quantified  by  a statistical  measure  called  t 


Coefficient  of  Excess  (EX),  defined  as  the 
and  narrover  the  distribution,  the  larger  E 


f the  fourth  aoeenl  of 
he  signal.  The  taller 


E((x  - x)2)2 

E((x  - J)l‘)  . - "s’lxU)  - 


Since  EX  Is  In  general  a positive  number,  a vei 
a high  SNB,  whereas  a Coefficient  of  Excess  n 


.X  corresponds  to 


corresponds  to  a very  noisy  r« 
The  decrease  in  SNH  of  I 
Is  not  fully  understood,  sine 
that  the  residue  represents 

the  case  in  many  pathological 
as  the  turbulence  and  other  n< 


he  residue  signal  in  pathological  signals 
t Che  results  are  complicated  by  Che  fact 
the  all-zero  filtered  output  of  a non- 
hoiselike  behavior  can  be  attributed  to 
e assumption  of  source-tract  sepacabllicy 
Incomplete  glottal  closure  occurs,  as  is 
tases.  However,  one  might  conjecture  chat 
Ise  at  Che  level  of  the  glottis  Increase, 


e degree  of  glottal  closure  decreases,  EX  decreases,  s 


NOBHft.  flffUIUOC  DISIS18UTIW 


Figure  2. Si  Aaplitude  distribution 
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PAmOlOGlCA.  M>LIIUD[  DISTPIBUTIW 


Figure  2.6t  Amplitude  distribution 


pachologicel  residue  signal 
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ia  correlated  to  the  SNU  of  the  resldiie  aignal. 

PI tch  Aaplitude.  The  degree  of  voicing  of  a voiced  aound  la  often 
■ good  Indicator  of  laryngeal  quality.  Such  a degree  la  given  by  the 
Pitch  Aaplitude  (PA),  vhlch  Is  the  valve  of  the  aaplitude  of  the  pitch 
period  peak  in  the  residue  signal  autocorrelation  (in  fact,  since  the 
autocorrelation  is  noroallaed,  Its  first  peak  at  the  origin  is  equal  to 
1,  and  the  PA  corresponds  Co  Che  second  peak).  The  autocorrelation  of 
Iho  vlndoved  residue  signal  is  defined  as 

t(k)  . e(l)h(l)e(i*k)h(l.k)  k . 0,1,...,N-1  (2.26) 

The  values  of  PA  vill  be  becveen  0 and  1,  and  a value  near  1 indicates  a 
clear  periodic  voice,  vhereas  a value  near  0 Is  the  sign  of  a "poor" 
voice  with  respect  to  periodicity  (Figure  2.7). 

In  his  study,  Schoencgen  (1985)  derived  a aioilar  peak  froe  the 
speech  signal  autocorrelation,  and  related  it  to  the  notion  of  Vocel 
Efficiency.  In  noroial  voiced  sounds,  a strong  periodicity  in  the 
glottal  volume  velocity  and  the  area  vaveforns,  as  veil  as  sycaetrlcal 
vocal  folds  movements,  vlll  produce  a high  PA.  For  breathy  sounds, 
vhlch  are  analogous  at  source  level  Co  the  generation  of  unvoiced  sounds 
and  voiced  fricatives,  the  PA  nay  be  expected  to  be  lov, 

Barnenlcs-to-Moise  Bacio.  The  visual  inspection  of  a spectrograa, 
for  a noraal  voice,  shows  veil-developed  harmonics  at  equally  spaced 
hoclsontal  stripes,  vhereas  that  of  a hoarse  voice  presents  noise  as  a 
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(¥ajia^ihara>  1967).  Based 


[he  Baraonics- 
c energy  of  the 


cloudy  shadow  between  thea 
observation.  Yunotc  et  al.(19B2)  derived  a n< 

The  following  eeihod  Is  used  to  lipleient  this  aeasuce,  First,  the 
speech  signal  a<t)  Is  divided  Into  pitch  periods  S|(t)  (ISlSi)i  each 
0 begin  at  one  point  after  the  preceding  proninenl 


defined  t 


In  order  to  average  consecutive  periods,  si(t)  is  assuned  to  be 
eoual  to  zero  between  the  end  of  of  the  duration  of  the  1*'’  period,  Tj, 


where  SgCt)  1 
coBponeni  Is 


*a('>  ■ 


:^3j(t)/n 

The  acoustic  energy  o 


(a-27) 

e harnonic 


The  Hareonics-to-No. 


t Ratio  is  then  defined  as 
HNS  . lOlogio^j 

e Baraonics-to-Noise  Ratio  is  believed  to  be  well 
lex  of  hoarseness  (Tuaoto  et  al.,  1982).  By  it. 
tabes  into  account  the  jitter  and  shiener  present 


(2.30) 
correlated  to 


which  Is  one  of  its  advantages,  since  jitter  effects  the  spectrum  of  a 
sustained  vowel  by  reducing  the  amplitudes  of  the  harmonics  and  adding 
noise  between  them. 

It  should  be  noted  that  in  previous  studios,  SO  cycles  were  used 
for  the  computation  of  HS*.  This  point  will  be  addressed  in  depth  in 
the  next  chapter, 

2.2,3  Perturbation  measures 

For  all  the  following  measures,  the  residue  signal  was  used. 

Perturbation  Quotients,  golke  et  al.  (1977)  indicate  that  steady 
vowel  sounds  normally  exhibit  slow  and  relatively  smooth  changes  in 
pitch  period  and  thus  suggest  that  perturbations  be  measured  from  a 
smoothed  trend  line  to  investigate  rapid  variations  in  periodicity, 
this  is  why  the  concept  of  Perturbation  Quotient  (also  called  Relative 
Average  Perturbation)  was  Introduced. 

This  quotient  is  defined  as 


1 N-(k-l>|l  k I 

iT<7T>  i'l  b 


where  k is  the  length  of  the  moving  average,  and  m ••  (k'l)/2.  Rased  on 
Davis’s  study  (1976),  we  take  k*5.  To  compute  the  Pitch  Perturbation 
Quotient  (PPQ),  d(l)  (also  called  p(i)  in  that  case)  is  equal  to  the 
length  of  the  ith  cycle  of  the  residue  signal  (in  number  of  samples), 
whereas  for  the  Amplitude  Perturbation  Quotient  (APO),  d(i)  (also  called 


Biplltude  of 


B(i)  in  that  caae>  la  equal  to  the 
residue  signal  (figure  2-8>-  As  noted  In  Figure  2.7,  the  residue  signal 
exhlhlis  positive  and  negative  peaks.  The  Perturbation  Quotients  are 
first  coeputed  using  the  positive  peaks,  then  using  the  negative  peaks, 
and  the  seallest  result  is  kept  (Davis,  1976).  PPQ  (or  AfO)  represents 
the  aeouni  of  "deviation"  froa  the  aean  period  (or  eean  amplitude). 
Except  vhere  noted,  the  deviation  vill  be  expressed  In  percent. 

Since  the  residue  signal  represents  the  speech  signal  with  the 
enoothlng  effects  of  the  Up  radiation,  vocal  tract  and  glottal  shaping 
remcved,  the  perturbations  resulting  froe  the  vocal  fold  action  at  the 
level  of  the  glottis  should  be  more  evident  than  vhen  using  the  speech 


Kean  Jitter  and  Shlnnier.  They  are  defined  as  follows 
Mean  Jitter  . l/(N-l)  * t | Pj  - P,.,  | (2,32) 

N-1  A| 

Shleeier  (dB)  . 20/(N-l)  • t | log(o  | (2.33) 

1.1  Ai.i 

where  Pj  Is  the  period  of  the  ith  cycle,  Aj  Is  the  peak  aeplltude  of  the 
ith  cycle,  and  M Is  the  nuabar  of  consecutive  cycles  analysed.  These 
■easures  vere  introduced  by  Morii  (1979  and  19B0). 

Percent  Jitter,  This  factor  eay  be  derived  by  dividing  the  mean 
In  BIS  tines  one  hundred.  It  Is 


Jitter 


Figure  2.8:  Pitch  periods  and  locations  used  For  the  coaputation 
of  perturbation  quotients 


- 42  - 


generally  believed  thac  the  percent  Jitter  for  e noroal  patient  should 
be  belov  2t,  and  that  of  a pathologic  speaker  should  be  above  BX;  hence, 
the  range  2g-BX  is  open  to  Interpretation. 

Directional  Perturbation  Factot  (PPf)-  This  paraieter  ves 
introduced  by  Becker  and  Kreul  (1971),  and  later  discusaed  by  Sorensen 
and  Uorii  (1964),  as  veil  as  Askenfelt  and  Haaaarberg  (1986)|  It  Is 
defined  as  the  percentage  of  the  total  number  of  differences  betveen 
adjacent  cycle  durations  for  which  there  la  a change  in  algebraic  sign. 

Pitch  Perturbation  Factor  IPPF).  Introduced  by  Liebernan  (1961), 
It  Is  defined  as  the  percentage  of  differences  between  the  duration  of 
adjacent  cycles  that  exceed  0.3  os. 


2.3  PI 

Since  about  half  of  t 
perturbation  oeasures.  It 

period  Is  usually  fairly  e 


:h  Measucenent  Scheme 
! acoustic  oeasures  used  In  this 

I.  For  Isolated  steady  vowels. 


e (for  which  a conplate  description  is  given  in  Chapter 
posed  of  pathologic  speakers,  whose  voices  can  be  very  deviai 
h is  the  case,  the  periodicity  of  the  residue  signal  Is  not 
y apparent  (see  Figure  2,4),  It  Is  therefore  necessary  to  oa 


- 43  - 


As  Bisntioned  earllsE,  the  autocorrelation  ot  the  residue  signal 
exhibits  peaks  spaced  at  the  pitch  peciod  {Figure  2.7a).  Therefore  a 
peak  detector  for  the  autocorrelation  sequence  (excluding  the  origin)  Is 
used  to  find  the  largest  peak  whose  aeplitude  is  presumed  to  correspond 
to  the  Pitch  Anplltudei  the  distance  frcB  this  peak  to  the  origin  gives 
the  pitch  period.  Bovever,  in  some  cases,  it  is  observed  that  a peak  in 
the  autocorrelation  that  corresponds  to  a subharmonic  of  the  fundamental 
frequency  Is  higher  than  the  true  pitch  period  peak  (Figure  2.7b),  which 
yields  an  incorrect  pitch  period.  To  circumvent  this  problem,  an 
algorithm  has  been  Implemented  (this  is  a modified  version  of  the 
algorithm  proposed  by  Davis,  1976):  the  first  six  peaks  excluding  Che 
origin  are  found,  and  the  peak  nearest  the  origin  is  chosen  as  the  pitch 
period  peak. 

2.3.2  Robustness 

In  order  to  evaluate  the  robustness  of  our  pitch  detector,  we  used 
the  Differentiated  Electroglotlogcam  (DEGG)  as  a reference.  The 
Eleciroglottcgram  (BGG)  is  the  time  varying  signal  that  reflects  the 
contact  between  the  vocal  folds  (Childers  and  Urar,  1984)i  the 
amplitude  variations  of  this  signal  are  thought  to  be  representative  of 
the  amount  of  contact  between  the  vocal  folds.  Figure  2.9  presents  an 
example  of  EGG,  with  the  corresponding  DEGG,  for  a normal  speaker.  In 
order  to  evaluate  the  pitch  period  from  the  DEGG  signal,  the  following 
algorithm  was  used:  a stable  portion  of  the  DEGG  was  displayed,  the 
number  of  negative  peaks  was  visually  evaluated,  and  a search  was  made 


for  those  peaks  using 
(see  Section  3.2)  was 
was  then  evaluated  by 


a peek  detection  pcogcae;  parabolic  interpolation 
used  to  Increase  the  accuracy.  The  pitch  period 
coeputing  the  average  of  the  differences  hetveen 


e pitch  period  eatimated  froa  t! 
found  in  Tables  2.3  and  2. A. 
f the  original  data  base  vas  ' 


The  coaparison  betveen  tl 

particular  study,  a subset  of  th 
consisted  of  the  highest  oualliy  v. 

Best  deviant  ones  for  the  pathologic  data  base.  The  voices  vere 
selected  based  upon  the  ludgaent  of  a speech  pathologist.  This  subset 
vas  coaposed  of  23  noisal  voices  (11  aale  voices  and  lA  teaale  voices} 
and  13  pathologic  voices. 

In  91X  of  the  cases,  the  fundaaental  frequency  vas  saaller  vhen 
estlasted  fro*  the  DEGG  than  froa  the  residue  signal.  Bovever,  in  all 
eases,  the  difference  betveen  the  Ivo  estiaates  vas  within  2 saaples,  or 
0.2  Bs  (since  a saapling  frequency  of  lOkHr  vas  used).  If  the  DEGG  Is 
considered  as  a reference,  ve  can  therefore  conclude  that  our  pitch 
detector  is  very  reliable. 

Even  though  the  autocorrelation  of  the  residue  signal  can  be 
difficult  to  analyze  (see  Figure  2.7.c),  Che  pitch  period  vas 


consistantly  detected  correctly,  and  worked  veil  for  high-pitched 
speakers.  One  source  of  error  cooes  fron  the  fluctuations  of  the  pitch 
period  for  sone  patients.  In  this  case,  the  choice  of  Che  portion  of 
speech  analyzed  is  crucial,  since  most  acoustic  aeasures  will  at  sone 
point  in  the  analysis  depend  on  the  pitch  period  obtained. 


Proa  Table  2,4,  ve  can  see  that  the  pitch  period  vas  not 


:\  f\  f\  f\  ^ ^ I] 

M j i i i i I I : ! I ! ' i I i I i 


Figure  2.10i  ExaupLes  of  pachclcglcal 


possible  to  obtain  for  pathologic  speakers  using  the  DEGG,  Figure  2.10 
shove  eiaaples  of  DEGG  signals-  The  first  signal  can  be  used  easily, 
vhereas  the  second  and  third  ones  shov  a changing  periodicity,  or  no 
periodicl ty  at  all. 

Hence,  ve  can  conclude  that  for  isolated  steady  vovels,  using  the 
autocorelation  of  the  residue  signal  is  a reliable  vay  to  find  the  pitch 


2-d  Relations  betveen  Perturbation  Hessures 
In  order  to  study  the  relation  betveen  the  various  perturbation 
aeasures,  it  was  decided  to  use  a general-purpose  foraant  synthesiser, 
which  falls  into  the  analysis-synthesis  category.  This  synthesiser  vas 
originally  designed  by  Klatt  (1980),  and  later  adapted  and  eodlfled  for 
our  laboratory  by  Pinto  (1967). 

The  advantage  of  this  synthesiser  with  respect  to  our  study  is  its 
flenlbillty  concerning  the  encltatlon  used]  four  sources  of  evcitation 

- Lf  Hodel  (glottal  voluae  velocity  paraaeters) 

- Pant  Circuit  Hodel  (glottal  area  paraaeters) 

- an  arbitrary  voluae  velocity  waveshape 

- an  arbitrary  glottal  area  function 

These  sources  will  be  described  aore  thoroughly  in  the  next  chapter. 

It  has  been  denonattated  previously  (Pinto,  19B7)  that  various 
voice  qualities  such  as  breathlness  or  roughness  can  be  siaulated  with 
fixed  vocal  tract  paraaeters  by  varying  only  the  excitation  paraaeters. 

Before  applying  the  various  acoustic  aeasures  to  our  data  base,  it 


teened  apptoprlaie  to  Investigate  the  releilons  between  soee  ot  these 
aeasures.  In  order  to  do  so,  we  synthesised  the  vowels  !U  and  /a/ 
using  fixed  foraant  tracks,  at  two  different  fundamental  frequencies: 
lIOHs  and  250Hs.  One  file  was  created  for  each  vowel  and  each 
frequency.  Par  each  file,  we  then  introduced  some  random  noise  in  the 
fundamental  frequency  track;  the  amount  of  random  noise  was 
characteriaed  by  its  variance,  k variance  of  M corresponds  to  a 
small  amount  of  noise  and  is  barely  detectable  when  listening  to  the 
synthesised  voice,  whereas  a variance  of  301  corresponds  to  a very 
breathy  voice.  By  introducing  some  jitter  in  the  excitation,  we  also 
Introduced  some  jitter  in  the  speech  signal  and,  hence,  in  the  residue 
signal.  The  goal  was  then  to  study  the  effecta  of  this  increase  in 
jitter  on  the  values  of  shimmer  and  Barnonlcs-to-Nolae  Ratio  in  the 
residue  signal.  The  amount  of  jitter  in  this  signal  is  reflected  by  the 
values  of  PPQ  and  XJII,  Ue  then  repeated  a eimllar  experiment,  but  this 
time  the  amount  of  shimmer  in  the  source  was  varied.  By  increasing  the 
amount  of  shloBrer  in  the  source,  we  also  increased  the  amount  of  shimmer 
in  the  residue  signal  (which  is  reflected  by  an  Increase  of  the  APQ), 
and  then  studied  the  rorresponding  increases  in  PPQ  and  HNR. 
Hillenbrand  (IQg?)  described  a comparable  study. 

The  results  pertslnlng  to  this  experiment  are  summariaed  in  Tables 
2,5  to  2.16.  and  in  Figures  2.11  to  2.14.  As  one  can  sae  from  the 
results,  there  Is  a very  strong  correlation  between  the  various 
perturbation  measures  and  the  darmonica-to-Noise  Ratio.  As  suggested  by 
Hillenbrand  (1987),  HHR  values  for  stimuli  with  a given  Jitter  value 
were  consistently  lower  for  stimuli  wlh  higher  fundamental  frequencies. 


Cocc«lAtlon 


AFQ  HNC  XJIT 


PPO  0.959  -0.953  0.993 


Table  2.Bi  Correlacion  between  varioua  meaaurea  far  /i/  and  F0p250be 


Effects  of  Introducing  randos  nolaa  in  the 
vhen  sjmthesiting  vovel  /a/  for  F0-130H8 


53  - 


Tabl«  2.1l!  Cotrelallon  between  neasures  for  vowel  /a/  and  P0-130Bz 


PPO  0.838  -0.899  0.999 

APo  -0.889  o.aao 
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■ ! Variations  n£  PPQ  and  HI 


Figure 
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Ve  also  oba«rv«  that  PPO  and  APO  ar«  strongly  correlated,  vhich  Bakes 
the  changes  in  RNR  observed  in  Figures  2.11,b  and  Z.lZ.b  more  difficult 
to  interpret.  In  fact,  the  sensitivity  of  HNfl  sieasurements  to  changes 
in  pitch  and  amplitude  Beasureaents  vaa  suggested  by  YuBOto  et  al. 
(1962)  when  they  introduced  their  new  signal-averaging  techniques  they 
clained  to  have  "accounted  for  this  departure  froD  Che  ideal  conditions 
|l.e.,  no  pitch  perturbation]  by  asauining  that  (the  signal)  is  equal  to 
aero  in  the  Interval  between  T(i),  the  duration  of  the  ith  period,  and 
T,  the  BiaxiBUB  of  the  T(l)”  (p.  15a5).  Our  results,  as  veil  as  those  of 
Billenbrand,  suggest  that  this  procedure  does  not  fully  account  for 
variations  in  jitter.  However,  before  interpreting  these  results 
further,  we  Bust  coBpare  them  with  what  happens  for  natural  voices, 
which  will  be  done  in  the  next  Chapter- 

Two  reasons  can  explain  the  Increase  in  ahlBOaer  with  the  increase 
in  jitter.  First,  the  intensity  of  a given  pitch  pulse  is  deterBlned 
not  only  by  the  intensity  of  the  glottal  source  waveform,  but  also  by 
the  relationship  between  harmonics  of  the  glottal  source  and  Che 
location  of  resonances  in  the  vocal  tract  transfer  function.  These 
relationships  will  beccBe  more  variable  as  the  pitch  perturbation 
increases,  therefore  increasing  Che  amount  of  cycle-co-cycle  variability 
in  output  intensity.  Second,  an  overlap  in  energy  occurs  between 
adjacent  pitch  pulses.  Uhen  a train  of  glottal  pulses  is  generated  , it 
is  generally  the  case  that  a given  glottal  pulse  will  be  generated 
before  the  previous  pulse  is  fully  damped  (Killeobrand,  1967).  Bence, 
energy  from  the  "tall"  of  a given  pitch  pulse  will  overlap  with  energy 
from  the  beginning  of  the  next  pitch  pulse,  and  the 


perturbation. 


shorter 


vlll  vary  vith  the  pitch 
frequency,  the  greater  the  anplitutle  at  tl 
therefore  eore  energy  viil  be  added  to  thi 
would  expect  that  the  APO  values  would  b 


the  fundanental 
r end  of  the  pitch  pulse,  and 
newt  pitch  pulse.  Hence,  we 
higher  for  high  fundaeental 


t clearly  illustrated  in  Pigures  2.11a  and 
o the  relatively  low  saepllng  frequency  u 


5 experlnent  (lOkHz). 


As  for  the  second  experiment,  there  appears  to  b 
PPO  and  HNR  when  APO  is  Increased,  but  the  variations 
as  for  the  first  experiment.  Hence,  we  can  conclude  thj 
ahinaer  in  the  speech  signal  (or  the  residue  signal) 
significant  influence  on  the  PPO  end  HNR. 


increase  in 


After  reviewing  the  LPC  analyaia  method,  this  chapter  presented  the 
various  acoustic  measures  used  in  this  study  <12  measures  were 
implemented).  Then,  using  a formant  synthesizer,  the  correlationa 
between  the  main  perturbation  measures  (Pitch  Perturbation  Quotient, 
Amplitude  Perturbation  Quotient,  Percent  Jitter)  and  the  Harmonics-to- 
Nolse  Ratio  were  studied.  There  appeared  to  be  a very  strong 
correlation  between  these  maasuresi  the  degree  of  correlation  also 
appeared  to  be  related  to  the  fundamental  frequency. 

The  next  chapter  presents  the  voice  data  base  used  (normal  and 
pathologic  voices),  as  well  as  the  results  of  the  various  acoustic 
measures  applied  to  a subset  of  this  data  base.  The  conditions  of 
enslysis  are  studied,  as  are  the  relations  between  the  measures. 


CHAPTER  a 
DATA  COLLECTION 

The  first  section  of  this  chapter  presents  the  data  base  used  In 
this  study.  The  acoustic  eieasures  ace  then  applied  to  this  data  base, 
using  conditions  of  analysis  that  had  bean  previously  described  in  the 
literature.  The  influence  of  these  conditions  of  analysis  are  then 
studied  in  depth,  and  ve  use  optical  conditions  to  test  our  measures  on 
a reduced  data  base  consisting  of  high-quality  and  deviant  voices. 

Using  the  formant  aynthesiaer  described  in  the  previous  chapter,  ve 
can  test  the  effects  of  using  different  glottal  sources  and  of  changing 
the  values  of  the  formants  on  our  acoustic  measures,  which  gives  us  some 
insight  on  the  quality  of  synthetic  speech. 


The  data  base  of  speakers  with  no  vocal  disorders  (nomal) 
consisted  of  27  normal  male  speakers  vith  an  average  age  of  36,  ranging 
from  21  to  Vi.  and  25  normal  female  speakers,  with  an  average  age  of  28, 
ranging  from  20  to  53,  None  of  these  speakers  had  a history  of 
laryngeal  disease,  Bach  subject  was  asked  to  phonate  successively  the 
vowels  /!/,  /o/,  and  /a/  during  about  2 s for  each  vowel  at  a 
comfortable  loudness  and  pitch.  Tlie  speech  and  electroglottographic 
signals  (described  later  in  this  chapter)  vere  recorded  vith  a Bruel  and 
Rjaer  Type  2619  condenser  microphone  and  a 


Synchrovoice  laryngyograpb. 


and  digitized  via  a Digital  Sound  Corporation  (DSC)  200  A/D  ayalen  onto 
a hard  disk  (or  further  procasslng  on  a VAX  11/750  computer.  The  speech 
signal  vas  also  recorded  onto  studio  quality  3N  AVX  60  cassette  tape 
using  a SONY  TC-PX520B  cassette  recording  deck.  All  recordings  vere 
made  In  a sound-treated  booth.  Both  the  speech  and  ECG  vavefocas  were 
compensated  for  phase  distortion  introduced  by  the  audio  tape  recorders. 

The  pathologic  data  base  consisted  of  23  voices  of  patients  vith 
vocal  disorders  (some  of  the  patients  came  before  and  after  treataent) 
and  seven  voices  that  nitsicked  several  vocal  disorders.  A full 
description  can  be  found  in  Table  3.1.  The  range  of  voices  varied  from 
mildly  deviant  to  very  deviant.  The  patients  vere  asked  to  phonate  the 
vovel  /!/  for  about  2 s. 


3.2  Frogtaa  Description 

A block  diagraa  of  the  Fortran  program  Isipletnented  to  compute  the 
various  acoustic  measures  can  be  found  in  Figure  3.1.  The  segment  of 
the  speech  signal  analysed  Is  first  windowed  using  a Hataming  window, 
whose  equation  Is  given  by 

h(n)  • .54  - .46  cos(2itn/Nj.  n.0,1, . . . ,N-1  (3.1) 

This  window  allows  for  a better  spectral  natch  between  the  reciprocal  of 
the  inverse  filter  spectrum  and  the  envelope  of  the  speech  spectrum  than 
a rectangular  window. 

It  la  known  that  a steady  vowel  has  a -6dB/octave  spectral  slope 
(Rabiner  and  Schafer,  1976);  this  is  due  to  the  fact  that  the  glottal 
shaping  spectrum  has  an  approKlmate  -12dB/octsve  slope,  and  the  lip 


Length  nf  interval  oE  analysis:  12S  is 
Concerning  the  Utter  point,  Davis  (1976)  argues  that  there  is  no  need 
to  update  the  filter  coefficients  every  20  is  for  Isolated  steady 
vovelsj  however,  this  point  will  be  adtessed  in  depth  later  in  this 
chapter.  Hence,  at  this  point  of  the  study,  we  were  not  concerned  with 
the  number  of  cycles  analyzed,  but  decided  to  consider  a fiaed  duration 
length,  regardless  of  the  fundamental  frequency. 

After  computing  the  residue  signal,  its  autocorrelation  was 
conputed  and  the  distance  from  the  origin  to  the  largest  peak  was  taken 
aa  the  pitch  period  (see  Section  3.6  for  more  details}.  The  amplitude 
of  this  largest  peak  is  equal  to  the  value  of  the  Pitch  Amplitude. 
Knowing  the  pitch  period,  both  the  positive  peaks  and  the  negative  peaks 
of  the  residue  signal  were  located  as  follows!  the  starting  point  was 
taken  as  Che  largest  positive  or  negative  peak;  then  a search  was  made 
for  Che  previous  positive  or  negative  peak  in  an  interval  centered  at  a 

interval  was  equal  to  2BP,  where  P is  Che  pitch  period  and  R.0.20.  This 
procedure  was  repeated  until  the  end  of  the  sequence.  Parabolic 
interpolation  was  used  to  improve  Che  accuracy  of  the  location  of  the 
peaks  and  their  amplitude.  The  procedure  for  such  an  interpolation  is 
as  follows  (Harkel  and  Gray,  1976):  a parabola  has  Che  fori 

y(k)  . aX2.  bUc  (3.3) 

If  y(0)  deEines  a discrete  peak  value,  and  y(-l)  and  y(l>  define  the 
sample  points  to  the  left  and  right  of  y(D),  then  the  parabola  passing 
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through 


has  coefficients 
c . y(0> 

b ■ (y<i)-y<-i>|/2 

a . ly(-l>*y(l)l/2  - y(0) 


By  solving  dy(X>/dX-0,  the  peak  location  Is  then  « 
a are  defined  above.  The  corrected  peak  aaplltude  Is 


y(-b/2a)  . -bVsa 


y(0) 


Peak  picking  and  parabolic  interpolation  are  also  perforaed  on  the 
speech  signal  Itselfi  in  order  to  ooepule  Che  Barnonics-to-Noise  Ratio. 

Once  all  the  peaks  of  the  residue  signal  have  been  found,  the  Pitch 
Perturbation  Quotient  and  the  Amplitude  Perturbation  Quotient  ace 

jitter,  percent  jitter,  shieqer,  directional  perturbation  factor,  pitch 

The  spectral  flatness  measures  and  the  Coefficient  of  Excess  are 
derived  from  the  speech  and  reaidue  signals,  and  the  RarBonlca-to-Noise 
Ratio  is  derived  from  the  speech  signal  alone. 


3.3  Testing  of  the  Haasures 

In  order  to  teat  Che  measures  used  by  Davis  (1979}  and  Prosek  at 
al.  (1967),  along  vlth  Che  Barionics-to-Noise  Ratio,  ve  used  Che  first 
seven  seasures  described  in  the  previous  chapter  (l.e.  PPO?  APO.  5PP, 


evaluated 


significance  of  the  difference 


- the  results  obtained  for  sale  and  fenaie  speakers  for  the  three 

presented  in  Tables  3-2  to  3-S.  In  order  to  interpret  the  various 

A statistical  Student  t-test  (Bruning  and  Kintz,  196S)  vas 
perforeed  to  evaluate  the  differences  between  the  various  neans.  This 

tvo  groups  of  subjects  is  significant;  its  formula  is  given  by 


r , <«l’^  . <«2>^ 

E*]2  . . - 

Wj ^ 

\ I (N,  . N2)  - 2 

where  Xj  corresponds  to  the  first  group  of  results 
X2  corresponds  to  the  second  group  of  results 
is  the  number  of  data  in  the  first  group 


obtained  by  Davis  (1976),  except  for  the  SFF,  vhose  absolute  value  is 
much  larger  in  our  case,  and  HX,  which  is  also  much  larger  in  our  case. 
These  differences  eay  be  attributed  to  the  different  sampling 


1 I 

Ni  Nj 


Ctequencies  <fi500Hs  foe  Davis, 


foe  Davis,  14  in  our  case),  la  order  to  verify  this  poiat,  ve  repeated 
[he  aeasurefleats  foe  tvo  nornal  subjects  (one  sale  and  one  feaale)  using 
a filter  order  of  6 and  a saapllng  frequency  of  5kHa.  For  the  nale 
speaker,  the  SPP  value,  previously  equal  to  11.947  (in  absolute  value), 
becaae  equal  to  S.B43,  vhlle  the  8X  changed  froi  19.672  to  3,835.  For 
the  feaale  speaker,  the  SFF  value  vent  frooi  -11.433  to  -7.747,  vhlle  the 
EX  value  vent  froe  3.218  to  1.067,  This  confiras  the  importance  of  the 
sampling  frequency  used  in  such  a study.  The  HNR  values  are  alBilar  in 
range  to  those  obtained  by  Yuaoto  et  al.  (1982). 

For  each  of  the  three  vovels,  the  differences  between  the  results 
obtained  for  male  speakers  and  female  speakers  ace  quite  significant. 
The  paraaeters  aost  affected  by  the  gender  are  the  Spectral  Flatness  of 
the  residue  signal  (larger  in  absolute  value  for  feaales)  and  the 
Coefficient  of  Excess  (larger  for  msle  speakers);  these  tvo  paraaeters 
show  a significant  difference  between  males  and  females  in  the  case  of 
all  3 vovels,  and  might  be  used  to  recognise  gender.  However,  more 
studies  are  needed  to  assess  the  reliability  of  these  acoustic  measures 
in  gender  recognition  tasks. 

As  for  the  use  of  different  vovels  in  studies  of  voice  quality, 
there  seems  to  be  no  obvious  trend  (which  is  consistent  with  the  results 
obtained  by  Koike  et  al.,  1977)  except  for  one  Important  factors  the  PPQ 
is  not  significantly  affected  by  the  choice  of  the  vowel  used.  Since 
sny  single  study  uses  the  same  vowel  for  all  speakers,  the  effect  of 
vowel  type  must  be  unlforaly  distributed  among  speakers,  and  therefore 
not  a significant  source  of  error. 
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NS  Not  Significant 

S Significant  (confidence  level  in  parenthesis) 


NS  Not  Significant 

S Significant  (confidence  level  in  parenthesis) 


Ve  can  conclude  that  eale  and  feraale  apeakers  should  be  treated 
aepacataly,  since  post  results  ate  significantly  different  for  the  tuo 
genders.  The  choice  of  Che  vovel  to  be  used  in  the  rest  of  our  study 
does  not  appear  to  be  excreDely  iaportant,  and  it  vas  decided  Co  use  the 


As  described  above,  a 
iBpleeentad.  Sooe  of  that 
seeeed  interesting  to  cor 
these  various  seas 
Jitter,  Percent  J 
nortial  data  base, 


ten  Various  Periurbation  Keasures 
!n  different  pertucbstion  peasures  have  been 
peasures  are  theoretically  related,  and  It 
te  the  coefficient  of  correlation  betveen 
seven  perturbation  peasures  (FPO,  PPP,  Mean 
DPP,  APQ,  and  shipper}  vere  applied  to  Che 
correlation  ves  calculated 


The 


; coefficient  c 


I,  PPP,  Bean  jitter,  percent 
X,  1P68),  The  results  are 


between  each  of  the  five  Jitter  mess 

shipper).  The  coefficient  used 
coefficient  of  correlation  (Bruning 
presented  in  Tables  3.6  to  3.8. 

It  appears  chat  the  Kean  Jitter  and  the  Percent  Jitter  are  highly 
correlated  for  both  pales  and  feoales,  which  was  expected  fcop  their 
definitions  (Percent  Jitter  is  equal  to  Kean  Jitter  divided  by  the  pesn 
period).  The  percentage  of  jitter  in  fepale  voices  le  hlger  chan  for 
pale  voices,  which  is  due  to  the  fact  that  female  voices  are  pore 
breathy  and  often  exhibit  Incopplete  glottal  closure.  An  incopplete 
closure  generates  soae  air  escapage,  and  hence  some  noise  in  the 
fundapencal  frequency  track. 


The  ritch  Fettucbatlon  Quotient  (PPO)  corceUtes  veil  vith  lost  of 
the  other  Jitter  heasures,  e:<cept  for  the  Directional  Perturbation 
Pactoc  (DPP).  The  Aeplltude  Perturbation  Quotient  <APO)  ia  highly 
correlated  to  the  Shinoer  factor  for  feeale  voiceai  eince  the 
coefficient  of  correlation  is  also  reiatively  high  when  nale  and  feaale 
speakers  are  conbined,  this  shieeer  factor  introduced  by  Horii  (19S0) 
does  not  appear  to  bring  anything  nev  vith  respect  to  the  APQ.  The 
Directional  Perturbation  Factor  does  not  correlate  well  with  any  other 

Given  the  results  of  this  correlation  analysis,  only  four  of  the 
perturbation  eeasures  were  kept  for  the  rest  of  the  study:  PPO,  APQ, 
Percent  Jitter,  and  DPP.  The  other  neasures  appeared  strongly 
correlated  to  these  four  ones,  and  hence  could  not  bring  any  nev 
inforeatlon. 

Hence,  only  nine  eeasures  were  kept  for  the  reealnder  of  the  study: 
PPQ,  APQ,  SFP,  SPB,  EX,  PA,  HNS,  Percent  Jitter,  and  DPP. 

3-5  Coaparison  Besidue  Signal  - DEGG 

Since  the  EGG  reflects  the  behavior  of  the  vocal  folds,  it  seeded 
interesting  to  coapare  the  perturbation  eeasures  obtained  by  using  the 
residue  signal  and  the  sa«e  neasures  obtained  by  using  the  DBOC. 

3.5.1  Algoriths  Used 

Pirst,  the  «ost  stable  portion  of  the  DEGG  signal  was  chosen  (this 
is  especially  important  for  pathologic  speakers  whose  DEGG  can  present  a 


irregular  behavior). 


The  nethsd  used  to  get  the  various  DEGG  leasures  was  Che  follovlng: 
ve  Hrsc  escleated  the  pitch  period  by  using  the  autocorrelation  of  Che 
residue  signal.  The  DEGG  vas  then  scanned  to  find  Che  largest  negative 
peak.  Using  this  point  as  a start,  a search  vas  Bade  for  the  previous 
negative  peak  in  an  interval  centered  at  a point  one  pitch  period  <P 
points}  renoved  from  the  first  peak,  The  sise  of  this  interval  vas 

previous  pitch  period  peak  vas  found,  and  Che  process  vas  continued 
until  the  start  of  Che  sequence  vas  reached.  Then  the  location 
procedure  vas  repeated  fron  the  initial  peak  to  the  end  of  Che  sequence. 
Parabolic  incerpolation  vas  used  on  all  peaks  for  greater  accuracy  in 
decernining  consecutive  pitch  period  durations  and  pitch  period 
amplitudes.  Once  all  the  negative  peaks  of  the  DEGG  had  been  found  in  a 


given  segment  of  Che  DEGG,  the  various  measures  could  be  computed  using 
the  same  definitions  as  vhen  using  Che  residue  signal. 


3.5.2  Results 

Of  Che  seven  perturbation  measures  used  at  the  beginning  of  this 
study,  four  could  be  computed  using  the  differentiated  EGG  (PPO,  APO. 
Percent  Jitter  and  DPF),  Ue  once  again  used  our  high-quality  and  very 
deviant  data-bases,  snd  analyzed  the  voices  using  the  DEGG.  The 
comparison  of  the  means  of  the  various  measures  obtained  by  using  the 
residue  signal  and  the  DEGG  are  presented  in  Table  3.9  and  3.10.  Ve 

voices),  and  the  results  ace  presented  in  Table  3.11. 

By  examining  the  data,  ve  can  see  that 


perturbation  leasures  obtained  using  the  DBSG  are  eueb  smaller  than 
those  obtained  using  the  residue  signal,  which  is  in  agreement  with  the 
data  presented  by  Horigushi  et  al.  (1987),  who  compared  perturbation 
measures  obtained  from  the  speech  and  EGO  signals.  This  could  be  due  to 
the  highly  regular  pattern  of  the  DEGG  in  most  cases.  Contrary  to  the 
residue  signal,  the  DEGG  is  not  subject  to  any  modeling  error,  and 
therefore  might  give  a better  Image  of  reality.  As  noted  by  Eaji  et  al. 
(1986),  the  amplitude  perturbation  of  the  EGG  of  pathologic  voices  ia 
larger  than  that  of  normal  voices.  However,  as  mentioned  earlier,  the 
DEGG  signal  might  be  Impossible  to  use  when  there  Is  no  clear 
periodicity  . or  when  week  secondary  peaks  occur,  as  is  sometimes  the 
case  If  the  speech  Is  diplophonlc  (see  Figure  J.IO  for  examples  of 
pathological  DEGG).  Hence,  it  was  decided  that  the  residue  signal  might 
be  more  adapted  to  this  study,  since  it  yields  very  good  results  for 
pitch  detection,  and  does  not  appear  as  difficult  to  analyze  as  the  DEGG 
in  the  case  of  very  deviant  voices. 


3.6  Conditions  of  Analysis  and  Results 
As  mentioned  earlier,  all  the  previous  results  were  obtained  by 
computing  the  Inverse  filter  coefficients  only  once  for  a 128  ms  period. 
However,  when  deviant  voices  have  to  be  analyzed,  the  interval  of 
analysis  is  a very  Important  factor.  Our  data  shows  that  pathologic 
volcas  can  exhibit  a very  normal  behavior  over  a period  of  time,  and 
suddenly  become  Quite  abnormal.  This  can  only  be  observed  if  a long 
interval  is  analyzed.  This  fact  raises  two  points: 

- an  interval  of  analysis  of  128  ms  msy  be  too  short  in  some  cases. 


ppo  m 

4PQ  (X, 

X3IT 

DPP 

«...  ■ 

,!:S, 

,l:Z 

(3-786) 

(14.909) 

(2-21) 

<;:SS, 

,s:;s, 

,sj;, 

,!:S. 

speaks  “ 

i.ll, 

S, 

10:  Coipatison  »C  th.  nean:  of  4 pectiirbotion  neasurcs 
obiain.d  by  using  the  residue  signel  and  the  DEGG 

PPQ  (X) 

APO  (X) 

XJII 

DPP 

6-SB 

(2.09) 

17.13 

(3.84) 

11.288 

(3.376) 

73.518 

(7.736) 

(1.53) 

(5.41) 

(3-340) 

73.437 

(16.636) 

PPQ 

APfl 

tJIT 

DPP 

MaU  ' 

(1.82) 

,::g. 

<u:on> 

,!:y, 

,1Z 

,S:S;, 

JI:’A 

,;:S, 

r,:'Z 

(oilM) 

67,917 

(9.011) 

Bij- 

,!:S, 

iS, 

,ss, 

e filter  coefficients  m 


o factors,  ve  decided  ti 


3.6.1  Nuaber  of  Cycles 

In  order  to  study  the  Influence  of  i 
updele  the  Inverse  filter  coefficients  every  Irene  (l.e.  every  23.6  es) 
and  to  vary  the  nusiber  of  cycles  analysed.  For  this  study,  five 
speakers  were  used!  two  nornal  aales,  two  noreal  fenales,  and  one 
pathologic  speaker.  A priori,  four  of  the  acoustic  seasures  used  seeaed 
possibly  sensitive  to  the  nuaber  of  cycles  analyzed!  PPO,  APO,  HNS,  and 
:oncernlng  the  RNR.  Yunoto  at  al.  (I9B2)  contend  that 
be  analyzed.  Figure  3.2  shows  the  variations  in  HNB 
cycles  Is  varied  for  two  speakers,  Even  though  these 
a very  different  pattern,  the  value  for  50  cycles 
combining  tl 


Percent  Jitter. 


e values  for 


appears  to  be  close  to  th( 

all  the  nuabers  of  cycles.  Hence,  50  cycles  seei 
number  to  compute  the  HNK. 

Similarly,  Figure  3.3  shows  Che  variations  In 
Jitter  when  the  nuaber  of  cycles  Is  varied)  these  c 
in  fact,  this  procedure  has  been  repeated  for  nany 
appears  to  be  no  general  paccern,  and  these  variations  are  strongly 
speaker-dependent.  However,  as  is  the  case  for  HNH.  the  value  obtained 
for  50  cycles  is  almost  always  very  close  to  the  mean  of  all  the  values 
coablned.  Hence,  we  selected  50  cycles  as  the  duration  of  data  to  be 
analyzed,  whereas  Tltze  et  al.  0987)  suggest  a number  from  20  to  30 
cycles,  but  with  several  replications  of  the  same  utterance, 
coefficients  vere  updated  every  frame  {with  no  overlap). 


just  ezanples,  and 
re  speakers,  there 


e filter 


Pigur«  3-2:  VgcUtlons  of  ENR  versus  nunber  at  cycles  snalyeed 
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3.6.2  Resul t8 

At  this  point,  ue  decided  to  coepare  the  results  obtained  vlth  and 
without  updating  the  inverse  filter  coefficients  every  fraee  when  the 
length  of  data  analysed  vas  equal  to  126  as.  For  this  particular  study, 
the  reduced  data  base  coaposed  of  high-quality  and  very  deviant  voices 
vaa  once  again  used.  The  sale  and  fcnale  voices  were  not  treated 

seperetely  In  this  case,  since  this  reduced  data  base  Is  Balnly  used  for 
a global  estinatlon  of  the  aost  efficient  neasures  for  The 
discrlBlnation  between  voices  of  very  different  qualities. 

Tables  3.12  and  3.13  present  the  results  of  our  nine  neasures  when 
applied  to  this  subset  of  the  data  base.  These  results  did  not  update 
the  filter  coefficients,  and  a fixed  length  of  the  speech  signal  vas 
analysed,  regardless  of  the  fundasiental  frequency.  This  Deans  that  Che 

and  3.15  present  the  results  when  the  filter  coefficients  were  updated 
every  frame,  and  50  cycles  of  data  were  analyzed  for  each  subject. 
These  two  tables  also  include  The  correlation  aetrlx  for  the  nine 

The  first  observation  is  that  the  two  methods  yield  very  siailar 
means,  which  tends  to  confirm  the  fact  that  for  isolated  steady  vowels, 
the  coefficients  of  the  Inverse  filter  do  not  vary  very  ouch  from  fraoe 
to  frame.  This  justifies  the  hypothesis  made  by  Davis  (1976)  that  there 
is  no  need  to  update  the  filter  coefficients  vhen  Intervals  up  to  128  ms 
ace  analyzed.  However,  the  analysis  of  SO  cycles  requires  a much  lunger 
interval  than  12B  os  In  the  case  of  low-pitched  voices  and,  therefore, 
need  to  be  updated  every  frame  (but  no  overlap 


filler  coefficients 


necessary). 


except  betveen  PPQ  and  APO.  PPO  and  XJIT.  SFR  and  PA  (Table  3.15),  and 
faetveen  SFP  and  HMR,  and  PA  and  HNP  (Table  3.16). 


The  correlation  betveen  SFR  and  PA,  alao  noted  by  Davis  (1976)  and 
Prosek  et  al.  (1987),  can  be  explained  as  (ollovs.  A decrease  in  the  PA 
indicates  eore  noise  and  less  periodicity  in  the  residue  signal,  vhich 
■eans  aore  noise  and  less  haroonic  structure  in  the  residue  spectrue, 
and  thus  an  increase  in  the  SFR.  This  can  alao  explain  the  correlation 


have  repercussions  on  the  speet 
of  the  higher  harnonics  of  thi 


of  the  speech  signal  (l.e.  in  HNR)  vill 
I and  residue  spectra.  If  the  aeplitudea 
signal  decrease,  the  open  quotient  vill 


increase,  vhich  Is  reflected  by  a decrease  in  HNR  and  an  increase  in  SPP 

Table  3.12).  It  is  interesting  to  note  that,  for  the  HNR,  the  results 
of  this  correlation  analysis  are  very  different  froe  those  obtained 
using  the  foraant  synthesiser  (see  Section  2.3),  since  no  significant 
correlation  betveen  HNR  and  the  perturbation  aeasures  ves  observed. 


vhich  ones  yield  different  results  for  high-quality  and  very  deviant 
voices}  this  vas  done  for  the  tvo  conditions  of  analysis  presented 
above,  and  the  results  are  presented  in  Table  3.16  and  3.17.  Both 
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Dlffer«ne« 


B«a5UC«5  for  normal  and  pathological  data 
ualng  a atudent  t-teat  (no  tiltet  updata) 


t-value  a,«0  9.693  0.186  7,870  2.534  12.570  4.196  5.607  1.622 


X level  of  algni ficance 
N5  not  Significant 


Table  3.17t 


Olfference  betveen  the  meana  of  vatlou# 

Rieaaures  for  normal  and  pathological  data 

using  a student  t-test  (filter  updated  every  frame] 


t-value  6.829  4.602  0.840  7.175  1.851  11.229  3.735  6.251  1.770 

Interpre-  99. 9X  99.9*  NS  99.9*  NS  99.9*  99.9*  99.9*  NS 


* level  of  signlficence 
NS  not  Significant 


Table  3.18:  Statistics  of  the  6 nost  efficient  leasuces 
applied  to  the  reduced  data  base  and  obtained 


1 hlgli-quality  voices 

2 very  deviant  voices 


SFR,  PA,  HNR,  and  Percent  Jitter).  Hie  three  other  measures  (SFF,  BX, 
DPF)  show  results  that  are  not  sisnlflcent,  or  less  significant  than  the 

six  isost  efficient  Beasures,  as  veil  as  their  means  and  standard 
deviations. 


Therefore,  at  this  stage  of  the  study,  ve 

uneorreleted  measures  of  voice  quality,  and 
discriminating  betveen  high-quality  and  deviant 


cere  efficient  1 


J.7  Testing  of  Heasures  using  a Formant  Synthesizer 

At  this  stage  of  the  research,  ve  examined  tvo  factors  of  the 
formant  synthesiaeci 

- Vhat  ace  the  effects  of  the  different  excitation  sources  on  the 
various  measures  when  the  latter  are  applied  to  a synthetic  voice  ? 

- It  is  known  (Childers  and  Uu,  196B)  that  the  second  formant  plays 
a particularly  important  role  in  Che  perception  of  synthetic  speech,  but 
no  quantitative  measurements  have  been  provided  to  substantiate  this 


3.7.1  Effects  of  the  Various  Excitation  Sources 

it  using  ouc  formant  synthesizer.  To  do  so,  ve  tracked  Che  fundamental 


<r«q\i«ncy  and  cha  Hrst  thraa  Earnancs  every  fcaoe  over  100  frames. 

Ve  then  synthesised  the  voice  using  two  types  of  excitation: 

1 - LP  model,  vhlch  can  be  varied  using  three  paraneters  (Figure  3.4}: 
Tp  - Time  span  for  glottal  flow  to  go  from  sero  to  maxifflufi 
value  (X  of  pitch  period) 

te  - Time  from  glottal  Uov  Initiation  until  instant  at  vhich 
elope  of  flov  attains  maxiaue  negative  value  in  cycle  <X 

Ta  - Time  constant  of  residual  decay  <X  of  pitch  period) 

Z - Pant  Circuit  Hodel  with  Glottal  Area  Parameters: 


Tclose  - ^open 
00  • 

SQ  - Speed  Quotient  defined  as 


(3-9) 


duration  of  positive  rate  of  flov 

SO.  (3.10) 

duration  of  negative  rate  of  flov 

SI  - Slope  Factor  (xlO) 

for  nodellng  the  glottal  volume  velocity  for  breathy  voices. 

The  various  results  are  sunnarlsed  in  Table  3.19,  and  the 
corresponding  excitation  vaveshapes  are  shown  in  Pigures  3.5  and  3.6. 
As  can  be  seen,  varying  the  Open  Quotient  In  the  Pant  Hodel  did  not 


affect  significantly 


X of  pitch  period 


exeilptlon  aodel 
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correapondinf  to  thd  largost  opi 

oxpectod.  tho  "beat"  ovorall  bi« 
corresponding  to  the  shortest  o 
These  results  suggest  chat  it  li 


e Open  and  Speed  Quotient.  On  the  other 
HodeL  did  affect  the  results.  The  case 
>n  phase  (Tp>63,  Te-dS,  Ta>12)  shovs  the 
but  also  the  best  HNE,  vhlch  seens  to 
r previous  correlation  analysis.  As 
issures  are  obtained  for  the  excitation 
pen  phase,  and  a nediun  residual  decay, 
s easy  to  synthesize  breathy  voices,  and 
that  such  voices  are  characterized  by  high  perturbation  neasures,  and 
also  relatively  large  values  c 


3.7.2  Effects  of  Poraant  Variations 

It  has  been  noted  by  several  researchers  that  the  second  foraant 
seeas  Co  be  particularly  laportanc  for  the  perception  of  synthetic 
speech  (Childers  and  Vu,  1988).  To  validate  this  observation,  ve 

first  three  foments  track,  keeping  the  two  others  constant.  The 
fundaaental  frequency  vas  sec  equal  to  130Hz.  The  results  are 
suaaarised  in  Figures  3.7  to  3.10.  Figure  3.7  shovs  Che  variations  of 


to  one  of  the  fornant  tracks,  whereas.  Figures  3.8  Co  3.10  show  the 
corresponding  degradation  of  the  speech  signal. 


The  aaounc  of  variation 


Flgucfi  3.5t  LF  Dodel  glottal  excitation 


varioua  pacameters 


- 95  - 


Figure  S.fii  Pant  aodel  glottal  axcitatlcn  uaing  various  paraaotars 
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C noise  edded  t 
1.310BS  and  F2.2030H:  vlU  n< 


a fcroant 
produce  lh> 


(3.11) 

noise  with  s-St  Co 
mnc  of  vaclation) . 


second  formant  will  produce  the  largest  variations  for  PPO  and  BNR. 


Hovever,  the  largest  average  perturbation  ve  get  vhen  adding  a noise  of 
202  variance  Co  the  second  foraani  la  only  about  1.7  sample.  Ue  also 
observe  that  changes  In  the  first  formant  do  not  affect  the  PPO.  On  the 
other  hand*  variations  In  all  three  foments  affect  the  him,  as  veil  as 
Che  Spectral  Flatness  of  Che  residue  signal.  This  latter  result  is  not 
shovn  on  Figure  3.7  but  vas  expected)  since  the  SP?  measures  Che  masking 


of  the  fundamental  frequency  harmonic  amplitudes  by  noise,  which  is 
related  to  changes  In  the  ferment  behavior. 


This  chapter  has  presented  the  results  of  the  various  acoustic 
measures  introduced  in  Chapter  2 vhen  applied  to  our  data  base,  Several 
points  have  been  studied,  including  the  robustness  of  the  pitch 

the  results,  as  veil  as  the  influence  of  variations  In  Che  excitation 
source  and  in  the  vocal  tract.  The  conclusion  Is  that  six  of  Che  12 
measures  Introduced  in  Chapter  3 seem  adequate  to  study  voice  quality. 
These  six  measures  are  the  ones  which  shoved  statistically  different 
results  for  high-quality  and  very  deviant  voices.  Hovever,  In  order  Co 
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rtally  t»st  these  oeasures,  they  now  need  to  be  applied  to  the  entice 
data  base,  and  correlated  to  the  results  oC  a subjective  listening  teat. 


CHAPTER  4 

CORRELATES  OP  VOICE  QUALITT 


To  date,  the  eoet  tellable  vay  to  evaluate  the  quality  of  a speoRet 
reealRs  the  subjective  evaluation  by  a listenec.  This  eethod  la 
subjective  by  daflnliloti.  Vhether  we  wish  to  diagnose  a pathology,  or 
slaply  to  evaluate  Che  quality  of  a synthesised  voice,  an  objective 

us  better  than  a subjective  evaluation.  Finding  such  eeasures  has  been 
the  objective  of  nany  ceaseacchecs  In  the  peat  25  years,  and  a great 
nusber  of  these  aeasures  has  been  described  In  the  previous  chapters. 
As  noted  previously,  sis  of  thea  have  been  retained  for  analysis. 

At  this  stage,  the  only  nethod  to  evaluate  the  capacity  of  these 
aeasures  to  predict  voice  quality  la  Co  correlate  then  to  the  results  of 
a listening  test-  Such  a teat  was  therefore  designed  and  carried  out  by 
seven  judges.  The  results  of  this  test,  and  their  correlation  to  the 
acoustic  aeasures  are  presented  in  this  chapter. 


1 Presentation  o 


e Listening  Test 


A, 1,1  Material  and  Method 

The  52  normal  voices  and  30  pathologic  voices  (including  alalcs) 
were  part  of  this  test.  Each  speaker  phonated  the  vovel  t\t  during 
approxiaately  2 s.  The  speech  samples  vers  presented  through  a DSC  A/0 


preceded  by  e tone  generated  by  a slneveve  and  was  repeated  twice. 

The  30  pathologic  voices  were  evaluated  on  five  dillerent  scales, 
each  scale  going  Eroo  1 (alldly  deviant)  to  7 (very  deviant).  These 
five  scales  were  overall  severity,  hoarseness,  breathiness,  roughness, 
and  vocal  fry.  Hence,  the  30  voices  vers  presented  live  tines,  but  In 
randojiiced  orders.  These  5 scales  can  be  compared  to  Che  "OPBAS"  scale 
widely  used  by  voice  specialists  In  Japan  (Imalauni,  1966).  which 
consists  ol  the  five  following  scalesi  grade  of  hoarseness  (G), 
roughness  (R),  bceathiness  (B),  asthenlclcy  (A),  and  strained  duality 
(S).  A relatively  new  test,  the  Diagnostic  Acceptability  Hessure 
(Quackenbush,  1986),  also  uses  various  scales  of  voice  quality.  Several 
slailar  nethods  have  been  used  in  the  past;  Prosek  at  al.  (1967)  used 
the  five  following  scales:  harshness,  hoarseness,  breathlnesa,  strained- 
strangle,  and  adequacy,  whereas  Kreul  and  Becker  (1971)  used  hoarseness 
harshness,  and  breathlness:  In  another  study,  Stolcheff  et  al.  (1983) 
asked  the  listeners  to  indicate  the  predominant  quality  of  voice  aaong 
roughness,  breathlnesa,  hoarseness,  strained,  and  normal.  Host  of  the 
studies  surveyed  used  either  a 4-polnt  scale  (Horan  and  Gilbert,  1984), 
a 5-polnt  scale  (Sapozhkov.  1972i  Kuvabara  and  Ohgushl,  1964),  a 7-polnt 
scale  (Prosek  at  a..  1987),  a 9-polnt  scale  (Kreul  and  Becker,  1971), 
nr  a 10-polnt  scale  (Rothauser  et  al.,  1971),  In  our  case,  due  to  the 
relatively  vide  range  of  voice  qualities  present  in  the  data  base,  a 7- 
polnt  scale  seemed  appropriate. 

The  normal  voices  were  Judged  on  their  overall  excellence,  on  a 
scale  from  1 to  7 (lipoor,  7:excellent) . The  Judges  were  also  asked  to 
recognise  the  gender  of  the  speaker;  gender  recognition  was  also  part  of 


Judging 


severity) . 

The  panel  of  listeners  consisted  of  s 

various  voice  qualities  described  above, 
signal  representative  of  the  quality  Judged 


even  Judges  (four  Bales  and 

and  all  faelllar  vith  the 

vas  presented, 
an  exaeple  of  the  scoring 


e.1.2  heliabillrv 

results  Co  a set  of  acoustic  aeesures.  Bovever,  in  order  to  do  so.  one 
must  take  the  Bean  of  the  ratings  given  by  all  Judges  to  a given  voice. 

sufficiently  high  (this  is  the  inter-Judge  reliability).  Sinilarly,  the 
intra-Judge  reliability  must  be  sufficiently  high  for  each  Judge,  so 
that  his  or  her  ratings  can  be  safely  included  in  the  eean. 

For  each  listening  task,  a certain  number  of  samples  already  rated 

vithout  the  Judge's  knouledge.  d */-2  scale  degree  vas  adopted  as  an 
acceptable  error  for  these  repeat  Judgments.  If  a Judge  did  not  achieve 
a degree  of  reliability  of  at  least  dOX  based  on  this  criterion  for  a 
given  task,  his  or  her  ratings  were  not  included  in  the  mean.  This  only 


reliability  vas  good,  given  the  difficulty 


r-Jud£e  reliability, 


Kendall  coefficient  of 


other  taeka  related  i 


pacholofic  voices  ( 
then  before  the  tasks), 
noraal  voices.  This  la< 
evaluating  voice  quality  w 


statistical  significance  vas  assessed  by  applying  a test.  The 
results  are  presented  in  Table  S.l.  As  can  be  seen,  the  concordance  is 
e overall  severity,  and  relatively  good  for  the 

r the  ratings  of  noraal  voices.  This  suggests 

it  had  different  sets  of  values  concerning  the 

h objective  oeasuces,  vhen  hunan  evaluation 
in  itself  varies  froa  person  to  person.  This  is  especially  revealing  in 

judging  voices;  we  can  hence  foresee  vhat  the  concordance  vould  be 
between  non-professional  judges, ,. 

Given  the  inter-judge  reliability  for  the  various  tasks,  it  was 
decided  that  taking  the  aeans  of  the  ratings  for  the  first  five  tasks 
(i.e,  for  the  pathologic  voices)  would  make  sense,  but  that  it  vould  not 
be  possible  to  do  the  saae  thing  for  noraal  voices,  In  the  latter  case, 
the  correlations  between  acoustic  aeasures  and  quality  ratings  would 
have  to  be  assessed  judge  by  judge.  This  aeant  that  no  general 
conclusions  could  be  derived  for  the  objective  evaluation  of  the  quality 
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2 Relations  bi 


ir  studies,  Kreul  and  Reckec  (1971),  Siolcheff  ec  si.  (19B3), 
et  al.  (1967)  reported  that  the  ratings  of  listeners  on 
Ice  quality  scales  were  highly  correlated  such  that  the 
breathiness,  and  roughness  scales  vere  indistinguishable, 
verify  this  fact,  the  correlation  aatris  for  the  five  scales 


, breathlness, 
s presented  In  Table  a, 2. 
e overall  severity  is  very 


It  roughness  and  vocal  fry 


roughness,  and  vocal  fry)  was  cooputed,  and  1 

fry.  It  also  appears  that  hoarseness 
breathlness,  roughness,  and  vocal  fry,  and  t 
are  also  strongly  related. 

In  order  to  analyre  these  oorrelations.  It  is  necessary  to  go  back 
to  the  definitions  of  these  various  laryngeal  qualities) 

- Breathlness:  audible  escape  of  air  through  the  glottis  due  to 
Insufficient  glottal  closure;  the  degree  of  breathlness  severity  is 

vibration. 

- Roughness!  lov-pltched  noise,  presuaably  due  to  irregular  vocal 
folds  vibrations. 

- Hoarseness:  defined  as  breathy  plus  rough;  it  is  therefore  a 
result  of  the  ooeblnatlon  of  excessive  air  escapage  and  aperlodicity  of 
vocal  folds  vibration. 

- Vocal  fry:  characterized  by  a rapid  series  of  taps  and  is 


S sS 
™ ti  1:1 


!:S 


'::r:S“"sSr 


surprising 


a rough  sounding  phonailon. 

FroB  chose  definitions,  ic  is  not 
be  highly  correlated  both  to  breathlness  and 
coabination  of  these  cvo  qualities),  and  the 
should  be  barely  distinguishable  (since  they 
lov-plcched  noise). 


that  hoarseness  should 

t roughness  and  vocal  fry 
are  both  characterised  by 


In  order  to  complete  this  analysis,  the  correlations  of  all  the 
ratings  of  the  five  voice  describing  parameters  over  Che  30  pathologic 
voices  were  tabulated  in  a correlation  natrln,  and  a "Principal  Pactor 
Analysis"  (Bataan,  I960)  vas  used.  The  ptinclpal  concern  of  faccoc 
analysis  is  the  linear  resolution  of  a set  of  variables  in  tecas  of  a 
snail  number  of  factors  (i.e.  of  categories).  This  can  be  achieved  by 
the  analysis  of  the  correlations  anong  the  variables. 

This  analysis  produced  two  factors  Chat  together  account  for 
102. 29(  of  the  coBoion  variance.  The  overall  severity  had  the  largest 
loading  on  the  first  factor,  folloved  by  hoarseness  end  roughness. 
Breathlness  had  Che  siallest  loading!  on  Che  contrary,  breathlness  had 
the  largest  loading  on  the  second  factor,  and  vocal  fry  had  a rather 
large  negative  loading  on  this  sane  factor.  Bovever,  one  must  be 
cautious  in  interpreting  these  results,  since  it  is  usually  admitted 
that  there  should  be  at  least  three  variables  per  factor.  The  small 
number  of  variables  used  in  this  study  (five)  does  not  permit  us  to  meet 
this  criterion. 

Finally,  the  multiple  linear  correlation  coefficient  obtained  using 
Che  ratings  of  hoarseness,  breathiness,  roughness,  and  vocal  fry  as 
predictors  and  the  rating  of  overall  severity  as  criterion  was  computed. 


The  value  vas  R^*0.929,  vlth  a P-value  oP  82.03  and  29  degrees  of 
Creedoe  (29df).  Rovever,  only  the  coefficients  of  hoacseness  and  vocal 
fey  ace  scecletlcally  significant.  Indeed,  by  coeputing  the  nultlple 


aence.  vlth  a aean  square  eccor  of  0.2S,  the  eating  of  overall 
severity  can  be  safely  derived  froa  the  ratings  of  hoarseness  and  vocal 
fry  using  Che  equation 


overall  severity  ■ 0.236  • O.632*hoarseness  * 0.395nvocal  fry  <h.l) 


ProR  this  analysis,  it  can  be  concluded  that  the  overall  severity 
of  a voice  Is  dependent  on  the  amount  of  air  escapage,  aperlodiclty  of 
the  vocal  folds  vibration,  and  the  presence  of  low  frequency  phonatlon. 


1.3  Results  of  Multiple  Regression 
The  sis  acoustic  aeasures  retained  for  analysis  uere  applied  to  the 
52  noraal  voices  and  30  pathologic  voices.  At  this  point,  as  indicated 
earlier,  tvo  strategies  vere  used:  a general  analysis  vas  done  for 
pathologic  voices  using  the  aeans  of  the  ratings  for  each  task,  vhereas 
the  results  of  the  listening  test  for  normal  voices  vere  analysed  Judge 
by  Judge,  due  to  the  poor  inter-Judge  reliability. 

used  vlth  the  acoustic  paraoieters  as  predictors  and  each  voice  quality 
scale  In  turn  as  criterion.  In  this  procedure,  the  variables  already  In 


scepvlse  ce;r«5slon  procedure 


th«  equation  are 
Intarcorrelatloaai  a 
not  be  laporcant  at  i 
these  recressions  la 
added  to  this  list; 


I latec  one.  The  list  of  acotistlc  neaaurea  used  in 
presented  in  Table  4.3.  The  speaker  gender  was 
It  was  introduced  In  the  regression  as  a dUAfty 


male.  The  stepwise  regression  procedure  is  suBiaaclsed  In  Figure  4.1. 
It  aust  be  noted  that  in  the  remainder  of  this  chapter,  FFO  and  AFQ  are 
not  evpressed  In  percent,  but  in  relative  value  (i.e.  In  percent  divided 


4.3.1  Pathologic  Voices 

For  the  30  pathologic  voices,  five  separate  analyses  were  Bade,  one 
for  each  rating  task  <l.e.  overall  severity,  hoarseness,  breathlness, 
roughness,  and  vocal  fry).  The  correlations  between  the  various 
acoustic  parasiecers  and  the  five  rating  scales  are  presented  in  Table 

(PPQ)  appears  to  be  the  predoBlnant  factor,  the  Pitch  Anplltude  {PA)  is 

qualities.  This  result  la  in  sgreeBent  with  the  results  of  the  study  of 
Prosek  et  al.  (19B7).  The  next  more  inportant  factor  is  the  Baraonlcs- 
to-Noiae  Ratio  (ENR)i  surprisingly,  however,  the  latter  does  not  have 
the  highest  coefficient  of  correlation  with  the  rating  of  hoarseness,  as 
was  expected  froa  previous  studies  {Tanahlgara,  1967),  Finally,  it 

degrees  of  roughness  and  of  vocal  fry.  In  fact,  except  for  these  last 


corr«liitlo(i9  aAd 


Aaplltude  P«rturbation 


QuDCt«ni  (APO)  and  tha  rating  o£  braathineasT  all  tha  corraXatlons 
praaeniad  here  are  aiatiatlcally  significant  at  tha  a ■ 0.05  level. 

correlated  to  eore  than  just  one  acouatle  paraaeter,  the  folloving 
procedure  vas  adopted  for  each  task:  a stapvisc  linear  regression  vas 

Then,  if  only  one  or  two  paraaeters  vere  retained,  a coaplete  second- 


T . BO  • B!>X1  . B2*gl’  (4.2) 

and  with  tvo  paraneters  (XI  and  X2),  the  equation  is  given  by 

Y > BO  • B1*X1  • B2*X2  • B3eXl«X2  • BdeXl^  « B5eX2^  (4.3) 

The  statistical  aignificance  of  this  second-order  aedel  coapared  to 
the  first-order  aodel  as  given  by  the  stepvlse  regression  (i.e. 
TvBO-Bl*Xl  with  one  parameter  or  T«BOtBl*XlvB2<X2  vith  tvo  paraaeters) 
vas  then  tested. 

Likevlse,  a linear  aultlple  regression  using  all  the  acoustic 

as  predictors  and  the  quality  scale  as  criterion  vas  run.  and  its 
significance  tested.  It  aust  be  clear  that  adding  more  terns  in  the 
regression  vlll  undoubtedly  increase  the  coefficient,  but  it  caaalns 


OVBKALL 

SEVBKIlr 

HOARSENESS 

BReATRINESS 

ROUGHNESS 

VO«L  PRV 

R^-o.e: 

PA, PRO, Gen 

R2.0.63 

iicr 

R^1o!s4 

to  be  seen  whether  this  increase  Is  statistically  significant,  and  not 
Just  the  result  of  adding  sore  terns. 

Finally,  the  best  nodel  was  adopted,  and  the  acoustic  correlates  of 
each  voice  quality,  along  with  the  of  Che  corresponding  aodel,  are 
suanarlaed  in  Table  4.b. 

Overall  severity-  The  scepviae  linear  regression  retained  two 
variables:  Fitch  Anplicude  (FS)  and  Hanconlcs-to-Holse  Ratio  (HNR),  vlth 
Che  contrihutlon  of  the  FA  being  the  nore  liportant  by  far  (partial 
of  0.585  for  the  PA,  conpared  to  a partial  R^  of  0.032  for  the  HNB). 
The  regression  equation  is  then  given  by 

overall  severity  • 6.553  - 4.736*FA  - 0.060*B1HI  (4.4) 

The  of  this  aodel  has  a value  of  0.617,  vith  an  F-value  of  21.74  and 
29df,  vhlch  oeans  that  Che  degree  of  significajice  a is  0.0001. 


statistically  slgnificani,  nor  did  the  coaplete  first-order  aodel  using 
all  the  acoustic  peraaeters  (in  that  case,  the  R^  vas  equal  to  0.659, 
but  this  slight  increase  vas  eainly  due  to  the  fact  that  aore  variables 
vere  added  to  the  equation,  and  nothing  else).  Bence,  Sq.  (6.4)  seeaed 
to  reflect  the  best  aodel  to  predict  overall  severity.  For  exaaple,  a 

expected  to  obtain  a 5 on  a scale  of  1 to  7,  which  aeans  a very  deviant 


Boarseness.  In  this  cnsn,  the  stnpvlae  Linear  regression  retained 
three  variables:  PAr  PPQ>  and  gender.  Once  again,  the  PA  vas  by  Par  the 
aosi  iBportant  factor,  vlth  a partial  of  0.5A5,  conpared  to  a partial 
r2  of  0.04*  for  PPO  and  0.037  for  the  gender.  The  regreaalon  equation 

hoarseness  - 5.256  - 3.968*PA  - 17.137*PP0  - 0.646*gender  (4.5) 

The  of  this  aodel  vas  equal  to  0.627,  vlth  an  P-value  of  14.54  and  a 
degree  of  significance  of  0.0001. 

Various  Bodels  were  built  using  the  three  variables  retained  by  the 
aiepvlse  regression,  as  veil  as  the  other  acoustic  paraaetersi  none  of 
these  aodels  appeared  to  bring  a statistically  significant  isiproveaeni 
vlth  respect  to  the  oodel  of  Bq.  (4.5). 

can  be  expected  to  be  given  a rating  of  5 for  hoaraenass  on  a scale  of  1 


Breathiness.  This  case  vas  the  aost  difficult  to  analyse.  The 
stepvlse  linear  regression  only  retained  tvo  paraieiers;  PPQ  and  XJIT, 
vlth  a r2  value  of  0.411.  The  fact  that  the  mein  paraaeters  to  predict 
Che  degree  of  breathiness  are  pitch  perturbation  related  quentitles  is 
not  surprising,  given  the  very  definition  of  breathiness.  Hnvever,  the 
lov  value  of  R^  suggests  that  other  paraaeters  contribute  to  the 
perceptual  iapression  of  breathiness  as  veil. 

Bence,  several  other  aodels  vere  built  to  identify  sone  of  these 
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parameters  that  vere  not  retained  by  the  siepvise  regression  because  of 
a lov  F-value.  First,  a linear  regression  using  all  the  acoustic 
parameters  as  predictors  vas  run;  a 8^  of  0.629  was  achieved  that  way, 
and,  using  a partial  F-test,  it  appesred  to  be  statistically 
significant-  Eovever,  using  a t-test,  some  of  the  parameters  in  the 
regression  equation  were  found  to  be  non-significant.  Bence,  a new 
linear  regression  vas  run  using  only  the  significant  parameters,  i.e. 


breathiness.}.S38tl4.876*Pf0.0.111**JIT-*.400*PA»0.147*llNS-0.950*gender 


The  for  this  model  was  e 
level  of  significance  of  0 


The  improvement  with  respect  to  the 
to  be  statistically  significant. 


retained  Che  foui 
regression  using 
Flatness  of  the 


For  this  quality,  the  stepwise  linear  regression 
ir  following  parametersi  PA,  gender,  HNR,  PPO.  A linear 
these  four  parameters  as  well  as  the  APQ,  the  Spectral 
Residue  (SFR)  and  the  Percent  Jitter  (ZJIT>  was  also 
not  bring  any  statistically  significant  improvement 
stepwise  regression.  The  equation  given  by  the  latter 


roughness  • 5.950-3.S57»PA-0.928*gender-12. 1BO*PPO-0.067*BMR  <4. 7) 


l«v«l  of  aifnificance 


).0001. 


paranotera  retained  b. 


the  BMR.  A roaplete 


the  rating  of  overall  severity,  the  tvo 
stepvlse  regreeslon  analysis  were  the  PA  and 
-order  Model  using  these  tvo  paraeeters,  as 


veil  as  a first-order  aodel  using  all  the  acoustic  paraeeters.  vere  then 
built,  but  did  not  ieprove  significantly  over  the  eodel  derived  froe  the 


vocal  fry  • 5.286  - 3.380*PA  - 0.064*BNR  (4.8) 

The  of  this  eodel  vas  equal  to  0.540,  vith  an  F-value  of  15.82  and  a 
level  of  slgnlfUanre  of  O.OOOl.  The  eost  iaportant  contribution  vas 
■ade  by  the  PA.  vlth  a partial  of  0.470.  A patient's  voice  with  a PA 
of  0.3  and  an  HNR  of  8.66dB  is  likely  to  obtain  a rating  of  3 on  a 1 to 
7 scale  of  vocal  fry. 


As  Mentioned  earlier,  the  Inter-Judge  reliability  vas  so  lev  for 
noteal  voices  that  it  did  not  perolt  taking  the  leans  of  the  ratings 
across  ell  Judges.  The  only  rating  requited  in  that  case  vas  the  racing 
of  overall  excellence  of  the  voice  judged. 

It  vas  therefore  decided  to  run  soiie  stepwise  regressions  using  the 
various  acoustic  parameters  (plus  the  gender)  as  pcedlctocs  and  the 
racings  given  by  each  judge  in  corn  as  criterion.  The  results  are 
presented  in  Table  4.6. 


analysis 


Table 


Caere latien 


0.167 

-0.280 

0.SO6* 


Table  4.8:  Keans  and  correlation  matrix  of  acoustic  measures 


Correlation 


As  can  be  seen,  The  leportaht  paraeeters  vary  from  judge  to  judge, 
and  no  definite  trend  can  be  derived  free  these  results:  noreover,  the 
8^  of  the  various  eodels  ace  lev,  ranging  froi  0.16  Co  0.A2.  These 
results  suggest  that  each  judge  relied  on  different  paranetera,  and  that 

It  eusc  be  noted,  however,  that  the  Spectral  Flatness  of  Che 
Besidue  (SFR>  was  "used"  by  three  of  Che  judges,  whereas  it  was  not  pact 
of  any  of  the  aodels  described  in  Section  4.3.1,  and  that  the  Pitch 
Amplitude  (PA)  seemed  once  again  to  be  a predominant  factor,  even  chough 
its  influence  was  not  as  important  as  with  pathologic  voices, 

4.4  Discussion 

As  described  in  the  previous  section,  the  results  of  the  regression 
analyses  performed  for  the  various  voice  dualities  emphasised  several 

important  one  is  Che  Pitch  Amplitude  <PA),  which  is  the  predominant 
factor  for  all  the  voice  qualities  except  breathiness  (see  Table  4.3>, 
as  well  as  for  the  rating  of  normal  voices  (see  Table  4,6).  The  second 
most  important  factor  appears  to  be  the  Barnonics-to*Noiae  Patio  (BNF), 
which  is  correlated  to  Che  judges’  decision  for  rating  of  overall 

to  be  very  important  for  the  ratings  of  normal  voices.  The  Spectral 
Flatness  of  the  Residue  (SFR)  and  the  Amplitude  Perturbation  Quotient 
(APQ)  do  not  play  any  role  in  the  judges’  evaluation  of  voice  quality 
for  pathologic  speakers,  but  SFR  is  taken  into  account  by  three  of  the 
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seven  judges  foe  the  rating  of  noraal  voices.  Bence,  ve  can  conclude 
Chat  APO  is  the  least  useful  factor  in  predicting  voice  quality,  whether 
this  voice  is  normal  or  pathologic.  Of  the  three  remaining  parameters, 
l.e.  Pitch  Perturbation  Quotient  (PPG),  percent  jitter  (XJIT),  and  the 
voice  gender,  PFQ  appears  to  be  the  most  important  one  for  Che 
prediction  of  quality  of  pathologic  voices,  whereas  gJIT  ia  the  most 
important  for  normal  voices  fhovever,  both  these  factor  play  a major 
role  in  the  prediction  of  breathlnesa).  The  speaker's  gender  does  not 
seem  to  play  an  inporiant  role  for  the  prediction  of  quality  <i.e.  in 
moat  cases,  there  is  no  statistical  difference  in  predicting  voice 
quality  for  males  or  females). 

Tables  4.7  and  4.0  present  the  correlations  betveen  the  various 
aessures  both  for  normal  and  pathologic  volcma.  five  strong 
correlations  ace  common  to  these  tvo  categoclesi  PPQ-APO,  PPO-*JIT,  APO- 
tJIT,  PA'BNR  and  SFR-PA.  The  first  three  had  been  noted  using  synthetic 
voices  (Section  2.4)  and  the  last  tvo  vere  noticed  using  the  reduced 
decs  base  (Section  3.6.3).  But  it  appears  that,  in  general,  the 
measures  are  much  more  inter-correlated  for  pathologic  voices  than  for 

group  than  normal  voices,  vhich  vas  also  reflected  by  the  good  Inter- 
judge reliability  for  the  various  tasks.  The  tvo  most  strongly 
correlated  measures  ere  SFR  and  PA  <It2._o.g7},  which  means  that  SFR  does 
not  bring  any  new  inforaation  with  respect  to  PA,  and  explains  why  it 
does  not  appear  in  any  of  the  models  built  for  pathologic  voices. 
Sifflilerly,  APO  ia  strongly  correlated  with  all  the  other  measures,  vhich 
ekplains  the  tact  that  it  does  not  play  any  role  in  the  regreaslon 


The  results  obtained  contradict  previous  studies  concerning  the 
KWI.  It  is  Interesting  Co  note  thee  Che  HNR  does  not  play  any  role  in 
the  prediction  of  hoersenessT  contrary  to  vhat  other  studies  suggested 
(YuBoto  ec  al,|  19B2:  Iuboco  ec  al.,  I98A),  as  a result  of  their 


reseBreh,  TuBoto  et  al.  (1902)  concluded  that  pathologic  voices  had  an 
HWR  soaller  than  7.4dB.  Our  results  do  not  support  this  conclusion;  101: 
of  the  pathologic  voices  analysed  had  an  HNII  larger  than  l.AdB,  and  191: 
of  the  noraal  voices  had  an  HWR  snaller  than  l.AdB.  AlaOr  contrary  to 
the  findings  of  YuBoto  et  al.  (1982),  feBsle  voices  had  a larger  HMR 
than  Bale  voices,  and  this  difference  was  scatistlcally  significant  (see 


Section  3.3). 

The  model  obtained  for  the  prediction  of  bceathiness  (Be-  (9.6)) 
suggests  that  an  increase  in  the  BNR  corresponds  to  a perceptual 
deterioration  of  the  level  of  breathiness  of  the  voice  Judged,  vhlle, 
from  Eg. (4.4)  for  the  prediction  of  overall  severity,  it  is  seen  that  an 
increase  of  the  flNR  corresponds  to  an  iaprovement  of  the  perceptual 
iBpressfan  of  overall  severity  (i.e.  if  the  HN?  increases,  the  voice 
quality  is  perceived  as  better  on  an  overall  quality  scale).  In  other 
words,  if  two  voices  of  the  sane  gender  having  the  sane  values  of  PPQ 
and  P*  ere  to  be  conpared.  the  voice  having  the  highest  KMS  is  likely  to 
be  Judged  as  the  nosi  breathy,  but  also  as  the  best  quality  one.  This 
confirns  the  results  obtained  using  the  fornant  synthesiser  (see  Section 
3.7),  vhen  ve  found  that  breathy  voices  were  associated  with  high 
perturbation  neasures,  but  also  high  Harnonic5*to-Noi$e  Ratios. 

of  roughness,  IbsIsubI  (1986)  concluded  that  the 
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acoustic  coccalates  of  roughness  include  both  the  multiplicative 
variations  which  occur  over  several  pitch  periods,  but  also  those  which 
ere  synchronous  with  the  vocal  pitch  period;  he  also  noted  that 
Irregularities  in  the  speech  wavefore  are  not  necessarily  essential  for 
perceptual  roughness.  Our  results  suggest  that  pitch  perturbation 
neasures  are  indeed  an  important  correlate  of  roughness  (since  PfO  is 
present  in  gq.  (4.7)),  but  the  eost  important  corralate  appears  once 
sgslh  to  be  the  PA,  with  a partial  of  0.569. 


The  primacy  goal  of  thla  pact  of  the  study  was  to  find  the  acoustic 
correlates  of  voices  covering  a wlda  range  of  the  voice  quality 
continuue.  The  resulta  were  nixed.  The  perceptual  evaluation  of 
pathologic  voices  on  five  different  scales  yielded  some  interesting 
results,  readily  Interpretahle.  On  the  contrary,  the  perceptual 
evaluation  of  normal  voices  did  not  appear  to  be  very  positive,  due  to 
the  fact  that  the  judges  all  relied  on  different  criteria,  most  of  vhich 
were  not  pert  of  the  set  of  acoustic  measures  studied  here. 

It  is  likely  that  a vocal  pathology  can  affect  more  than  one 
dimension  of  the  voice  simultaneously,  vhich  would  explain  the  high 
correlations  between  the  various  quality  ratings,  as  shown  in  Table  4.2. 
This  makes  the  task  of  estimating  specific  attributes  of  the  voice  using 
acoustic  parameters  much  harder.  In  order  to  achieve  this  task 
satisfactorily,  more  Infotmallon  concerning  the  effects  of  specific 
pathologies  on  vocal  fold  vibration  is  needed. 

Rovever.  the  results  obtained  in  this  study  for  pathologic  voices 
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were  good  enough  to  be  able  to  derive  eodels  to  predict  the  varioua 
voice  ouelitiest  it  is  clear  that  these  nodels  do  not  predict  perfectly 
the  degree  of  a given  quality,  but  the  were  high  enough  to  affira 

weaknesses  of  such  a study  resides  in  the  fact  that  the  acoustic 
measures  used  as  predictors  of  quality  represent  only  a fraction  of  the 
set  of  all  the  measures  used  by  a human  listener.  however,  the 
relatively  high  values  of  suggest  chat  the  set  of  acoustic  measures 
chosen  for  this  study  was  adequate.  The  values  of  the  obtained  for 
the  various  regressions  were  slightly  inferior  Co  chose  obtained  by 
Prosek  ec  si.  (1987)  in  a stollar  study,  but  this  was  mainly  due  to  the 
fact  that  less  acoustic  parameters  were  used  as  predictors,  In  order  to 
retain  only  those  paranecets  chat  were  statistically  Important.  As 
mentioned  earlier,  adding  more  predictors  in  a regression  analysis  will 
always  improve  Che  value  of  the  R^,  but  this  nay  not  have  any 
statistical  value. 

The  main  conclusion  of  this  study  was  that  the  best  predictor  of 
all  Che  qualities  studied,  except  breathiness,  was  the  Pitch  Amplitude; 
this  result  is  consistent  with  the  findings  of  Proaek  et  al.  (1997). 
Such  a result  might  suggest  that  a sort  of  "autocorrelation  process" 
takes  place  in  Che  huoian  auditory  system.  Ve  can  summarise  the  results 
concerning  pathologic  voices  as  follows: 

overall  quality:  its  degradation  is  characterised  by  a low  PA  and  a 
low  hNR. 

hoarseness:  a hoarse  voice  exhibits  a low  PA,  a high  PPQ,  and  Is 


judged  according 


bccathlness:  a breathy  voice  U characterized  by  high  pitch 
perturbatioa  Deaaures  (PPQ  and  XJIT),  a loe  PA,  and  a high  liNII;  the 
gender  is  also  a significant  factor. 

roughness:  a rough  voice  Is  characterized  by  a lov  PA,  a lov  KNR, 
and  a high  PPOj  the  gender  also  plays  a role. 

Vocal  fry:  characterized  by  a lov  PA  and  a low  UNit, 

By  taking  the  neans  of  all  ratings,  ve  erased  the  differences  of 
evaluation  between  the  various  listeners.  The  values  of  Che 
coefficients  of  concordance  for  the  various  tasks,  far  fron  being 
perfect,  shoved  that  such  differences  esist;  however,  these  values  were 
still  high  enough  CO  justify  the  fact  of  taking  Che  Deans  of  the  various 


ratings.  However,  for 
regressions  were  run  for 
gender)  as  predictors 

Table  4,9.  As  expected, 
R4  of  the  various  nodels 
for  nomal  voices  (see  T 


' sake  of  conpleteness,  stepwise  linear 
h judge  with  the  acoustic  aeasurea  (plus 
the  various  rating  tasks  (concerning 
: criterion.  The  results  are  presented  in 
results  vary  from  Judge  to  judge,  but  the 
generally  much  higher  chan  those  obtained 
4.6),  and  some  trends  can  In  general  be 
observed  for  each  voice  quality.  This  suggests  that  taking  the  neans  of 
the  ratings  was  a justified  method. 

As  aentioned  at  the  beginning  of  this  chapter,  the  judges  were  also 
instructed  to  recognize  the  gender  of  the  30  pathologic  voices  and  of 
the  32  noraal  voices.  Anong  the  normal  voices,  only  one  male  voice  was 
Incorrectly  identified  by  three  of  the  Judges,  and  the  quality  of  this 
voico  was  judged  to  be  rather  poor.  This  male  voice  was  characterized 
by  a rather  large  Percent  Jitter,  but  not  the  highest  among  normal 


for  pathologic 


voiecsi  three  eubjecte  were 
identified  by  at  least  six  of  the  Judges,  and  all  three  were  sale 
voices.  Two  of  these  subjects  were  judged  to  have  a very  deviant  voice, 
whereas  the  third  one  had  a sildly  deviant  voice.  No  trend  aeong  the 

easily  identifiable. 

Pinally,  in  order  to  coepare  the  aeount  of  inforeatlon  carried  by 
vowels  as  cospared  to  sentences,  a second  listening  test  was 
iBipleaented.  This  test  had  the  sane  fornat  as  the  test  described  at  the 
beginning  of  this  chapter,  but 


were  away  a year  ago."  Of  course,  no  acoustic  oeasures  were  available 

as  a voice/unvoiced/silence  detection,  which  is  in  contradiction  with 
the  need  for  siaple  eeasures),  but  it  seemed  interesting  to  compare  Its 
results  to  the  vowel  teat  results,  since  both  of  then  concerned  the  sane 
normal  and  pathologic  data  base.  Six  of  Che  previous  seven  judges  were 
the  listeners  for  this  test.  A coefficient  of  correlation  between  the 


two  teats  vas  computed  for  each  task  and  each  judge,  and  the  reaults  are 
presented  in  Table  4.10.  The  correlation  vas  quite  poor  for  the 
Judgment  of  normal  voices,  thus  suggesting  that  listeners  rely  on 
different  sets  of  criteria  to  judge  a voice  depending  on  what  they 
listen  to  (vowels  or  sentences).  As  for  pathologic  voices,  the 
correlations  were  higher  than  for  normal  voices,  but  appeared  to  be 
listener-dependent.  Sowever.  the  Judgments  of  overall  quality  and  of 
roughness  were  quite  consistent 


between  the 


* significant  at  the  99Z  level 


The  results  of  this  last  study  seen  to  indicate  that.  In  general, 
the  nature  of  the  utterance  used  to  judge  the  duality  of  a voice 
influences  greatly  the  results  of  a test.  Since  a sentence  carries  more 
inforeatlon  than  a vovel.  the  set  of  criteria  on  vhich  a listener  bases 
its  decision  is  probably  eider  for  sentences  than  for  vovels,  vhich  can 


explain  the  difference  of  results  between  the  tvo  teats.  It  is  also 
worth  noting  that  the  noroial  tsale  voice  incorrectly  identified  by  three 
judges  vich  the  vovel  listeoiog  test  vas  correctly  identified  vhen 
aeniencea  vere  used.  This  vas  alao  the  case  for  tvo  of  the  three  sale 
pathologic  voicea  that  had  been  previously  aistuken  for  fenale  volcos. 

Tbo  study  described  in  this  chapter  identified  the  acouatic 
paraaeters  that  bast  predict  the  duality  of  a given  voice.  Thia  la  what 

as  coders  or  synthesisers,  it  is  often  necessary  to  Judge  the  duality  of 
a voice  vich  reapecc  to  another  voice  (or  reference),  vhich  can  be 
called  "relative  judgaenc."  In  that  case,  distance  eensures  need  to  be 
ifepleeented,  end  chin  ia  tbe  object  cf  Che  none  chapter. 


DISTORTION  BERSUBES 


Thf  need  foe  rellablo  distance  measures  is  one  of  Che  most 
challenging  problems  facing  speech  scienclsca.  Whether  It  is  to 
evaluate  ayncheaizers  or  coders,  distance  measures  that  can  reproduce 
Che  Judgments  of  huean  listeners  are  critically  needed.  This  Is  a 
difficult  task,  hovever,  since  the  perception  of  speech  Is  a highly 
complex  process.  In  the  fields  of  speech  or  speaker  recognition,  too, 
there  is  a critical  need  for  distance  measures  that  can  effectively 

the  literature  are  deacribed,  and  their  efficiency  vhen  applied  to  our 
data  base  are  detailed;  their  Intercorrelaclons  are  also  studied.  Then, 
using  the  results  of  the  previous  chapters,  ve  describe  a nev  Euclidian 
distance  measure  based  on  various  acoustic  parameters,  and  study  its 
efficiency. 

Besides  testing  the  various  distance  measures  on  our  human  data 
base,  ve  also  use  the  formant  synthesiser  described  in  the  previous 
chapters.  This  allovs  us  to  study  the  reactions  of  the  various 
distances  vhen  prescribed  distortions  ate  applied,  and  hence  to  verify 
our  various  hypotheses. 

and  "distortion  measure"  will  be  used  indiscriminatingly,  even  though 
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v’(e) 


difference 


log  BagnlTude 


frequency  scale  Is  defined  by 


D(6)  . lClogiolV(9)|2  - 101okio(V'<6)|2 


(5.3) 


(he  spectral  eodels  Is  the  set 


of 


|D(S)|  de|l(P 


(5.4) 


Foe  p-2t  the  res  log  spectral  measure  is  defined.  Figure  5.1 
presents  an  eseeple  of  tvo  log  spectra,  along  slth  their  difference  and 
rna  log  spectral  distance. 

likelihood  ratios,  Using  LPC  analysis,  ve  define  the  residual 
error  by 


£ ai*(n-l) 


(5.5) 


Then  the  residual 


a . E [e(n)j2 


(5.4) 


The  coefficients  C»i)  are  chosen  to  einleUe  a,  vhich  can  be  considered 
to  be  the  output  of  an  inverse  filter  A(a)  where 


(5.7) 


Figure  5.2a). 


LK9CC1M 


diffscence 


residual 


Figure  3,2i  loplenentailon  of  the  likelihood  ratio 


V and  (c/(n)l). 


the  likelihood  ratio  Is  gceater  than  1.4,  the  dlCCetetice  betveen  the  tvo 
nodels  becoDes  pecceptlble  (Gray  and  Harkel,  1976). 

Another  notation  often  used  In  the  literature  la 


aj^R 

aj^Riai 


vhete.  for  a given  aodel.  Rj  Is 


e undistorted  and  distorted  speech  signals,  i 


Let's  assuae  that  the  speech  is  generated  by  a Gaussian  process, 
the  result  of  white  noise  passed  through  an  all-pole  filter  of  order  K, 
and  that  the  analysis  interval  of  length  14  Is  nuch  longer  than  the  all- 
pole filter  order.  Then  the  Itakura-Saito  eeasure  (Juang,  19fl4b)  Is 
defined  as 


A(v)  . log  |k„(eJ*)|2  - log  (X'n(eJ“)|2  (5.16 

Xnl^e^'')!^  is  the  Fourier  tranafora  of  x„(i). 

Note  that  this  aeasure  is  not  syaeetric.  A distortion  sequence  i: 


»[s(i),s'(i)l  ■ (eji) 

vhece  n Is  the  frane  index  designating  the  vlndov  location  ai 


If  Xn£eJ'')«ff/A<ei'') , where  A<a)  is  the  Inverse  filter  polynoeial, 
n A(v)*D(v)  (see  Eqs<  (S.3)  and  (5.16)),  and 


efCS)  . D{e>  -1  - 


Klatt  distance  neasure.  In  various  experleents,  Carlson  and 
Granstroe  (1979)  and  Klatt  £19fl2)  found  that  judgeents  of  phonetic 
distance  differ  substantially  froa  Judgments  of  psychophysical  distance 
in  that  only  foriaant  frequency  changes  contribute  significantly  to 
phonetic  changes,  whereas  spectral  till,  foraant  anplitude  changes,  and 
high-pass,  low-pass,  and  notch  filtering  changes  ace  less  important. 

Based  on  this  observation,  a weighted  coapariaon  of  spectral  slope 
was  developed  (Klatt,  1982).  The  weighting  function  cakes  into  eccounc 
whether  one  is  at  a peak  or  a valley  in  the  spectrua  and  whether  this  is 
Che  biggest  peak  in  Che  spectrun  or  not.  If  dB(i)  is  the  output  in 
decibels  of  the  i-th  channel  of  a 30-channel  critical  band  spectral 
representation  of  a sustained  vowel  (this  critical  band  spectrua  was 
iapleiented  according  to  Che  eodel  described  by  Klatt,  1976),  Chen  the 
spectral  slope  at  the  center  frequency  of  filter  i,  SKI),  is  given  by 


5L(i)  . dB(l.l)  - 


(5.20) 


weighting  function  Is  given  by 


W(l)  . Woas<l)«Wloeoas(i)  (5.21) 

Viiax(l)  - Knax  / (Ksax  * denax  - dB<i)>  (5.22) 

Wloc«ax(l)  . Kloeiaax  / (Kloceax  . dBlociiax(l)  - dB(l))  (5.23) 


Xn  Eqs.  (5.21)  and  (5.22),  dBnax  is  the  eaxiniun  output  over  all 
channels,  and  dBlociaaxd)  is  the  output  of  the  nearest  peak  to  any 
channel!  It  is  found  by  computing  SL(1)  and  then  climbing  the  spectral 
peak  to  the  right  If  SL(1)  Is  positive,  or  climbing  the  spectral  peak  to 
the  left  if  SL(i)  is  negative  or  zero.  Kmax  and  Rlocmax  are  empirical 
constants.  Figure  5.3  presents  the  critical  band  spectrum  for  the  vowel 
/!/.  along  with  the  graph  of  the  weighting  function  when  Kmax  and 
Klocmax  are  equal  to  10. 

The  Klatt  distance  between  two  spectra  Is  then  given  by 


30 

SD12  . K*(E1-E2)  * £ V12(1)»|SL1(1)-SU(1) 


(5.24) 


where  K is  an  empirical  constant,  B1  and  E2  are  the  overall  energies  of 
the  two  spectra,  and  »12(i)  is  the  average  of  weighting  functions  Vl(l) 
and  02(1)  derived  separately  for  the  two  spectra. 

Based  on  the  results  of  previous  studies  (Barnwell  et  al.,  I»95i 
klatt,  19B2),  it  was  decided  to  take  K.O,  and  hence  not  consider  the 


reflecting  the  absolute  energy  difference  between 


riLiEi  nersRCiM 


Figure  3.3i  Criticsl  bend  spectra  end  corresponding  veighting 

fimction  dor  KlatC  distance  measure  ^,,,,-10) 


Root-Pov«r  Suns  OPS)  distort lem.  Vhecees  the  KUtt  distance 
measure  as  described  above  is  based  on  differences  betveen  slopes  of 
critical-band  spectra,  several  Investigations  (Hanson  et  al-,  1986; 
Tohkura,  1986)  go  one  step  further  in  applying  this  concept  of  slope 
dillerence  to  all-pole  lodel  spectra  derived  free  LPC  analysis.  Using 
Eqs.  (5.1)  and  (5.2),  the  spectral  slope  measure  is  then  expressed  as 

■•spee.  slope  ''ll”  log|V(9))E  - — log(V'(©)|2  j^se  (5.25) 

Hovever,  this  foceuUtlon  is  not  a very  efficient  one.  To  clrcunvent 
this  problent,  we  use  the  cepstral  coefficients  fc(,),  which  arc  defined 
as  the  coefficients  of  the  Taylor  series  expansion  of  the  polynoelal 

UloE/|A(eJ9)|E|  . £ cge-JliS  (3.26) 

It  can  be  shown  (Cray  and  Harkel,  1976)  that  the  cepstral  coefficients 
can  be  coeputed  froe  the  autoregressive  coefficients  using  the  fornulas 


M • -M  (5.27) 

-kc|(  - Itajj  . MnCn)a|,.„  for  k.2,3, . . . ,H  (5.28) 

kca  . - E l(k-n)cg_n)»n  fot  k.H-l.H-2. . . . (5.29) 


where  H is  the  order  of  the  inverse  filter. 


Then,  using  PacsevsL's  relation,  it  can  be  ahovn  that,  as  the 
number  of  cepstrun  terse  becomes  Infinite,  Eq.  (5.25)  can  be  computed 

'^RPS  <5.30) 

Bq.  (5.30)  la  called  the  Eoot-Poeer  Sums  (BPS)  distortion  (Hanson  and 
Vaklta,  19S6). 

Juang  ec  al.  (1987)  give  another  Interpretation  of  this  sun.  They 
show  that  the  cepatral  coefficients,  except  the  first  one,  have  xero 
means  and  variances  essentially  Inversely  proportional  to  the  square  of 
the  coefficient  index,  l.e., 

El|Ck|2)--  (5.31) 

k2 

Bence,  the  k^  factor  in  Bq.  (5.30)  normalises  the  contributions  from 
5.1.2  Parametric  Distances 

These  distance  measures  refer  to  geometric  distances  in  domains 
vhere  the  vocal  tract  filter  has  been  paraaeterised  in  some  vay 
(Barnwell  and  Bush,  1978).  Host  of  these  parameterixatlons  are 
associated  with  the  IPC  model  (e-g.  feedback  coefficients,  PASCOH 
coefficients).  One  of  the  most  common  parameterlratlon  cones  from  the 
cepstral  coefficients,  as  described  above. 


{5.32) 


ilp<E,t’) 


- ViV'P 


vhere  is  ths  paraneiec  ({or  esasple  LPC  coefficients)  sod  N is 
ttte  number  of  paraeeters  involved  in  Itle  representation. 

If  cepstrsl  coefficients  are  used,  tlten  by  applying  Parseval's 
relation  to  Eq.  (5.26)  and  using  the  fact  that  ci^ac.i^,  ve  obtain 


dj 


(5.33) 


ms  log  spectral 


measure  defined  by  Eq.  (5.0). 
to  1 terms  to  define  a cepstral 


|u(L)J  . t (c,^  - Ck'): 


Gray  and  Karkel  (1976)  shoved  that  if  LaH  (vhere  H is  the  inve 
filter  order),  then  the  correlation  between  u(i.)  and  d2  is  already  v 
high  (larger  than  0.98).  They  also  tested  u(L)  by  la)clng  L«2H  and  U 
and  shoved  that  the  correlation  betveen  u(L)  and  d2  increased  vith 
number  of  terns.  This  correlation  will  be  verified  in  the  next  secti 
In  fact,  u(L)  can  be  interpreted  as  the  rns  distance  betveen 
log  spectra  after  each  log  spectrum  has  been  cepstrally  smoothed  tc 


coefficients. 


5.1.3  Synaetrlge^  Pistancgs 

ABong  the  distance  neastires  described  above,  it  is  easy  to  see  chat 
the  likelihood  ratios,  log  likelihood  ratios,  and  liakuca-Saico 
distortion  aeasures  ace  not  synaetrlc,  thus  violating  one  ol  Che 
principles  of  distance  measures.  However,  it  is  easy  to  obtain  a 
aymaetrlc  measure  froa  these  measures. 

Changing  the  roles  of  the  reference  spectrum  and  test  specirua  in 
Eq.  (5.19)  is  equivalent  to  replacing  D(9)  by  *0(0}  in  the  integrand 
(Gray  and  Hackel,  1976),  i.e., 

IS'.  I"  I e-0(e>  . D(0)  -ij  ^ (5.36) 

Then  ve  define  the  coah  aeaaure  as 

a . - (IS  . IS')  • I lcosh(D(0)|  - 1)  — (5.37) 

In  order  Co  relate  a Co  a decibel  scale,  ve  can  define  u in  terms  of  the 
Inverse  function  used  in  Che  integral  of  Eq.  (5.37),  I.e,, 

cosh(ia)  - 1 . C (5.38) 

m • Ln|UD.ja(2.e)|  (5.39) 

Gray  and  Hatkel  (1976)  stress  the  fart  that  the  cosh  measure  weighs 
large  differences  in  log  spectra  more  heavily  than  the  rms  measure. 


celatfldi  Ehiis  slupllfying  their  loipleoientatlon.  Considering  D(d)  given 
by  Eq.  (5>3)i  it  can  be  shown  that 

j D(6)  de/2n  . 2Ln<o/o')  <b'bO) 

where  &/a  is  the  likelihood  ratio  defined  by  Eq.  (b.l2),  and  c and  o' 

Renee,  subatltuting  Eq.  (S.dO)  and  <S.ai)  in  Eq.  <3.19),  we  obtain 
IS  . (o/o')2(«/a)  - 2ln(o/o')  - 1 (5.42) 

S/..1.IS  (5.43) 

SlDilarly, 

IS’  . <o’/o)2(6'/a')  - 2Ln(o'/o>  - 1 (5.44) 

Renee,  the  Itakura-Sailo  distortion  eeasuce  is  closely  related  to 
the  likelihood  ratio. 


(5.45) 


and.  substituting  Eq.  (5.44)  into  Eq.  (3.19),  we  get  (if  o-o' ) 


(5.46) 


Finally,  substituting  Eq.  (S.«2|  and  (3.44)  into  Eq.  (3.37),  ve  gat 
9 . 1/2  {o/o')2(5/o)  . l/2(»-/»)(S'/.')  - 1 (3.47) 


(5.46) 


5.1.5  Choice  ol 


As  seen  previously,  the  choic 
spectral  aeasure.  the  cepstcal 
represented  by  eg  io  Eq.  (3.54)), 

eininized  if  Che  gain  constants  are  identical, 
choices  o(  gain  are  aade  (Gray  and  Karkel,  1976) 
1 - all  gains  are  equal  to  a constant,  sui 
all  the  log  spectra  have  zero  average  value. 


tain  is  crucial  for  the  res  log 
ire  (for  vhich  the  gain  is 
Itahura-Saito  neasure,  and  the 


gains  o^.a/r,(0)  and  o' 2»o'/t'^(0).  where  (r^ln))  and  (r',(n))  are  the 
autocorrelation  sequences  for  Che  data  sequences  Cain))  and  fz'(n)}, 
respectively. 


3 - in  order  to  oake  the  aodel  spectral  energy  equal  Co  that  of  the 
original  signals,  ve  choose  o^«a  and  o'^aa*.  This  enables  us  to  Include 
Che  effects  of  intensity  changes  on  evaluating  spectral  differences  in 
sounds  (Gray  and  Harkel,  1976). 

4 - ve  aay  choose  a gain  that 


nininizes 


1 respect  to  and  setting  t 

Then  the  ■ 


el  al.  (1978). 


• El:  It  Is  dsElnsd  ss 

El  . j <5.50) 

where  V(i)  is  Che  weighting  function,  and  c>l. 

• E2:  It  is  the  average  over  the  top  lOE  of  the  frane  errors.  This 
average  is  based  on  Che  fact  that  subjective  judgaents  ace  greatly 
influenced  by  one  or  two  large  frame  errors,  which  errors  are  not 
eaphssired  by  El  (Vlswanachan  ec  al.,  1978). 

- E3:  it  is  the  sue  of  El  and  E2. 

- Ed:  since  the  threshold  of  IQX  used  for  E2  aay  be  coo  rigid  in 
SDoe  cases,  a weighted  coaposice  average  was  designed.  It  Is  obtained 


E4  . El  * vE2  <5.51) 


end  a is  the  "skewness*'  of  the  Erane  error  distribution  over  the  whole 
utterance  defined  as 

1 I 

a . - ^MB(l).Ell5/og3  (5.53) 

where  eg  is  Che  standard  deviation  and  L is  the  number  of  frames 
analyzed . 

Proa  their  study,  Vlswanachan  et  al.  (1978)  concluded  that  E4  was 


averaging  technique. 


ColleclloB 


In  order  to  test  the  various  distance  measures  described  above,  the 
data  base  of  b2  normal  voices  and  30  pathologic  voices  vas  used.  The 
Itakura  Salto  distance  measure  vas  not  tested,  since  it  is  so  closely 
related  to  the  likelihood  ratio  (see  Eq.  (3.42)  and  (3.43)).  The 
cepstral  measure  vas  implemented  using  the  first,  second,  and  third 
choice  of  gains  (see  Section  3.1.3).  whereas  the  cosh  measure  vas 
implemented  using  all  four  choices  of  gains.  For  Che  rms  log  spectral 


measure,  the  gains  vere  set  equal  to  1. 

All  in  all.  13  distortion  measures  vere  implemented,  i 
log  spectral  measure,  cvo  likelihood  ratios  (&/m  and  I’/m 
likelihood  ratio  (LLR),  three  cepstral  measures  (CEEl  TO 


Before  relating  Che  various  13  measures  to  subjective  judgments,  ve 
first  tested  their  efficiency  in  judging  quality.  In  order  to  do  so.  ve 
selected  three  normal  speakers  vith  high-quality  voices  (tvo  male  and 
one  female  voices)i  ve  then  computed  the  various  distances  between  the 
cvo  male  voices  and  each  of  the  10  other  high-quality  male  voices  (for  a 
description  of  the  reduced  data  base,  see  p.  44).  and  becveen  Che  female 
voice  and  each  of  the  13  other  high-quality  female  voices.  This 
produced  32  results  for  each  distortion  and.  since  all  the  voices  used 
vere  very  good,  the  distances  vere  relatively  small.  Ue 
the  various  distances  between  four  high-quality  voices  (tvc 


* expressed  In  decibels 
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Ac  this  stage,  four  distortion  measures  were  hence  dropped  from  the 
sec  oC  distortion  neasures:  Che  cvo  likelihood  ratios,  the  oepstcal 
eeaaure  coepuced  vlth  the  second  choice  of  gain  (i.e.  o^ao/r,{<0)  and 
o'^ea' /r'v(O)) , and  the  cosh  eeasure  also  vith  the  second  choice  of 


Finally,  two  correlatloa 
distortion  neasures  and  the  tvc 
presented  in  Tables  5.4  and  5.: 
measure  (D2)  and  the  cepstcal  : 
(CBPll  vere  strongly  correlated. 


strongly  correlated.  Thus, 
the  set  of  distortion  measi 


matrices  vere  computed  for  the  nine 
0 sets  of  distances.  Che  results  are 
5,  As  expected,  the  riis  log  spectral 

The  fourth  cosh  measure,  or  COS4  <l.e. 
rth  choice  of  gain)  and  CCPl  vere  also 
earned  justified  to  drop  D2  and  COS4  from 
:,  which  left  us  vlth  seven  distortion 


5,2,2  Correlation  vlth  Subleetive  Judgments 

Usually,  distance  neasures  are  computed  betveen  the  input 
output  of  a vocoder  or  a synthesizer,  and  the  results  i 
correlated  to  the  subjective  evaluation  of  the  degradation  of  ti 
after  passage  through  this  vocoder  or  synthesizer.  Ve  used 


approach.  In  order  to  test  our  distortion  m 

listening  test  described  in  Chapter  4)  and  th 
our  data  base,  thus  producing  30  measures  f 
conditions  of  analysis  vere  the  same  as  th 
5.2.1,  vith  only  one  differencei  an  overlap  of 


suras,  ve  first  applied 
(determined  from  the 
30  pathologic  voices  of 
each  distortion.  The 
e described  in  Section 


Uc  c)i«n  correlated  the  30  aeasures  tor  each  distortion  to  the 
scores  o{  the  listening  test.  This  vas  done  (or  the  judgsenis  of 
overall  severity,  hoarseness,  breathiness,  roughness,  and  vocal  liy. 
The  high-quality  (or  reference)  voice  vas  considered  free  of  any 

voices  vas  reflected  by  the  results  of  the  listening  teat,  thus 
justifying  our  method. 

For  this  correlation  analysis,  the  four  averaging  techniques 
described  in  Section  3.1.6  vere  used  successively,  in  order  to  deterelne 
vhich  one  vas  the  aost  adapted  to  each  distortion  aessure  (i.e.  vhich 
one  yielded  the  highest  coefficient  of  correlation  between  the  various 
qualities  and  each  distortion  aeasure). 


the  coefficients 
correlations  betveei 


in  Table  3.5.  the  results  vere  not  good.  None  of 
of  correlation  vere  significant,  except  the 
I the  degrees  of  overall  severity,  roughness  and 
vocal  fry  and  C0S3,  and  the  degree  of  roughness  and  C8P3.  Surprisingly, 
these  coefficients  of  correlation  vere  all  negative,  vhich  means  that  a 
degradation  of  the  voice  quality  corresponds  to  a decrease  in  the 
distance  aeasure.  These  results  suggest  that  the  third  choice  of  gain 
(i.e.  e^wv  and  is  the  best  one  for  the  cepstral  and  cosh 

oeasuces.  Contrary  to  the  findings  of  previous  studies  (Klett,  1962; 
Bacnvell  et  al.,  1983),  the  Klatt  distance  measure  did  not  perform  veil 


validity  of 


repea  ted 


experinent  using  a dlffacant  high-quality  voica  as  reference-  The 
distance  (using  any  distortion  eieasure)  betveen  the  first  and  second 
reference  voices  sas  very  seall.  Using  the  second  reference,  the 
results  of  the  correlation  analysis  were  very  siailar  to  those 
previously  obtained.  This  shows  that,  as  long  as  a high-quality  uolee, 
free  of  any  dlsiortlDn,  Is  used  as  reference,  the  results  of  the 
correlation  analysis  will  stay  the  sane,  thus  proving  the  validity  of 
our  aethod-  Figures  S.S  and  5.5  Illustrate  the  difference  betveen  high- 
quality  and  deviant  voices.  They  show  the  LPC  log  spectra  variations 
between  two  consecutive  frames  for  a high-quality  voice  and  a pathologic 

are  very  ssiall,  indicating  a very  stable  spectrum.  On  the  contrary,  the 
variations  are  ouch  larger  for  a pathologic  voice. 

5.3  Euclidian  Distance  Measure 

As  described  in  Section  5.1.2,  pararLetrlc  distances  have  been  used 
extensively  in  the  past,  for  speech  quality  as  well  as  speech 
recognition  applications.  Since  sone  of  the  acoustic  oeasures  described 
In  the  previous  chapters  proved  to  be  very  efficient  in  predicting 
various  voice  qualities,  it  seemed  reasonable  to  use  then  to  form  a new 
Euclidian  distance  measure. 


5.3.1  Definition 

quality  predictors  in  Chapter  i (see  Table  A. 2).  If  ve  want  to  compare 
two  speech  signals  having  acoustic  measures  PPQj,  APQj,  SFRi,  PAj,  HH?j, 
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Figure  5.4:  SpeciruB  Inxer-lrane  variability  for  noraal  voice 


IK  KCIB 


Piguce  S.S)  Sp«ccrua) 


f'Craap  vaclablllty  Pot  pathologic 
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and  XJITj  {!■!  tor  first  speech  signal.  ia2  for  s 
Che  proposed  Buclidian  distance  measure  has  the  for 


d speech  signal). 


. [f;l»a2ppoj.i[2,42apoj,K3,42spBj.K4»fi2pAj.K5»a2i|NBj.KS.a2*JITjll'2 


a Indicates  the  difference  between  ih 
B computed  for  two  speech  signals. 


As  seen  in  Chapter  4,  all  the  acoustic  measures  used  In  Eq.  (S.34) 
did  not  have  the  same  efficiency  in  predicting  voice  quality,  which 
explains  Che  presence  of  the  weighting  factors  K1  to  K6.  In  order  to 
find  the  optimal  value  for  each  of  these  coefficients,  we  used  the  30 
pairs  of  speech  signals  described  in  Section  3.2,2  (l,e.  a high-quality 
voice  versus  the  30  pathologic  voices),  and  computed  the  coefficient  of 
correlation  between  Che  30  Euclidian  distances  obtained,  and  the  results 
of  the  listening  test.  This  was  repeated  for  each  of  the  five  voice 
qualities  previously  descrihed,  l.e.  overall  severity,  hoarseness, 

decerained  Che  constants  K1  to  K6  chat  yielded  the  highest  coefficient 
of  correlation.  The  results  are  summarized  in  Table  3.6,  For  all  the 
voice  qualities,  except  roughness,  the  weighting  coefficients  reflected 
the  importance  of  each  acoustic  parameters,  as  described  in  Table  4.4; 
for  roughness,  it  appeared  chat  only  Che  Fitch  Amplitude  (PA)  was 
important  in  the  design  of  the  Euclidian  eieasure,  while  the  other 
significant  acoustic  parameters  for  toughness  did  not  need  to  be 
included  to  Improve  the  coefficient  of  correlation.  As  can  be  seen  in 
Table  3.6,  the  coefficients  of  correlation  were  very  high  for  all  the 
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Table  5.?!  Results  of  stepwise  regression  between  distance  eeesures 
and  results  of  listening  test  scores 


Sever! ty 

Hoarseness 

Breethiness 

Roughness 

Vocal  Pry 

factors 

Is:!" 

^Uj.COSl, 

R^.0.739 

S^.O.SSB 

dggc.CSPl 

r2.0.7B2 

BgUC'COSJ 

rJ.0,601 
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volc«  qualities,  except  breathiness,  which  eeans  that  the  Euclidian 
distance  proved  to  be  very  efficient. 

The  same  procedure  was  repeated  using  another  high-quality  voice  as 
reference,  and  the  results  obtained  were  the  sane  (i.e.  the  constants 
found  using  the  first  reference  also  produced  optlnal  coefficients  of 
correlation  using  the  second  reference).  This  proves  that  the  new 
distance  does  not  depend  on  the  reference  chosen. 


It  is  likely  that  optimal  distance  eeasures  can  be  obtained  from  a 
conbination  of  several  distance  measures.  To  verify  this  fact,  we  once 
again  used  the  30  pairs  of  speech  signals  previously  described;  the  30 
distances  thus  obtained  with  each  of  the  seven  distortion  measures 
described  In  Section  5.2.2  and  the  Euclidian  distance  eeasure  Introduced 
in  the  previous  section  were  then  coabined  for  a stepvise  regression 
analysis.  These  eight  distortion  measures  were  taken  as  the  predictors 
of  the  listening  test  scores,  for  each  of  the  five  voice  qualities 
considered  (i.e.  overall  severity,  hoarseness,  breathiness,  roughness, 
and  vocal  fry).  The  results  are  sueaarized  in  Table  5.7.  Eor  each  of 
the  voice  qualities,  the  Euclidian  distance  measure  was  clearly  the 
dominant  factor;  (h)Sl,  C0S3,  and  CEP3  also  played  a role.  Except  for 
breathiness,  the  R^'s  obtained  were  relatively  high. 

5.6  Synthetic  Voices 

Some  results  on  the  human  auditory  system  have  been  reported 
previously  (Carlson  et  al.,  1979;  Klatt.  1982),  and  it  seemed 


inceresting  to  coBpaco  the  behavior  o£  our  distance  iieasuces  to  these 
results  vhen  prescribed  distortions  are  applied  to  the  synthetic  vovel 


* analysis  parameters  (e.g.  f 


5.4.1  Distance  betveen  Normal  and 

syncheslaed.  Seventeen  voices  v 
distorted  synthetic  voice]  using  ' 
three  CormentSr  fundamental  frequency)  without  any  alteration,  tvel 

and  third  formant  tracks,  and  four  voices  with  an  increasing  amount 

various  distances  between  these  voices  and  the  original  male  voice.  T 
results  are  presented  In  Table  5.3. 

As  can  be  seen,  the  results  depend  on  the  distortion  measure  use 


degradation  resulted  in  an  increasing  distance  between  the  synthetic  and 
the  natural  voices.  An  Increasing  amount  of  noise  in  the  formant  or 
fundamental  frequency  tracks  did  not  always  result  In  an  Increasing 
distance  betveen  Che  synthetic  and  natural  voices,  except  for  the  kPS 
distortion  measure.  This  night  be  due  to  the  fact  that  the  formants  and 
fundamental  frequency  may  have  occasionally  been  Incorrectly  detected 
when  analyzing  the  natural  voice,  and  the  noise  addition  nay  have  in 
fact  compensated  for  such  errors,  thus  making  the  resulting  synthetic 
voice  closer  to  the  natural  one. 

The  Euclidian  distance  measure  appeared  to  be  very  senaitive  to  a 


Table  3.8i 


Dlatancea  betveen  eale  natural  voice  and  diatorted  voice 
after  passage  through  foment  synthesizer 


degradation  of  the  third  fornant  track  and  of  the  fondaaental  frequency 


S.d.2  Distance  between  Synthetic  Vovels 

For  this  experiment,  the  vowel  l\.l  (characterized  by  a second  and 
third  fornants  that  are  close  Co  each  ocher  in  frequency)  was  first 
synthesized  using  the  following  parameters  (this  vowel  will  thereafter 
be  cefered  to  as  the  non-distorted  synthetic  vowel); 


PS.3750BZ 


synthesized  using  the  paraneters 
; the  distorted  synthetic  vowels): 


BivlOOX  and  B1-30X 
BltlOOX  and  B2-30X 
B3«100X  and  B3-30X 


Dsing  the  B distortion 
the  distances  between  these 
distorted  synthetic  vowel  wet 

distances  were  much  larger  wl 
changes  in  bandvidths  were  e: 


described  in  Sections  3.2  and  3.3. 
synthetic  vowels  end  the  non- 
coaputed.  The  results  are  presented 
using  any  distortion  measure,  Che 

his  is  in  agreement  with  the  results 
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of  Carlson  at  al,  (1979)  vho  describad  a siailar  axparinenc  vlth  tha 
vovel  /ae/i  but  used  parceptual  ludgBants  rather  than  disiaiica  maasuras. 
As  a result  of  their  study,  they  concluded  that  "the  auditory  systea 
seeas  to  be  at  least  20  tines  more  sensitive  to  changes  In  fornant 


a certain  parameter 


frequency  than  to  changes  in  formant  bandvidth"  (p. 

resulted  in  about  the  same  distance  between  t 
distorted  synthetic  vovel,  except  when  using  the  Euclidian  distance 
eeasure.  This  fact  is  also  in  agreement  with  the  results  of  Carlson  et 
al.  (1979).  It  also  appears  that  for  a given  percentage  change,  the 
distance  vas  greater  shen  the  tvo  foments  P2  and  P3  were  changed 
together. 


In  their  studies,  Carlson  and  Granstron  (1979)  and  Klatt  (1982) 
made  a distinction  between  psychophysicel  and  phonetic  distance.  For 
the  former,  the  listeners  vere  asked  to  take  into  account  any  difference 
between  vowels  whereas  for  the  latter,  they  vere  asked  to  rate  only 
changes  that  tend  to  influence  the  vowel  identity,  and  to  disregard 
changes  associated  with  harshness,  speaker  identity,  or  transmission 
channel.  Using  the  synthetic  vovel  /a7  and  a phonetic  distance,  Klatt 
(1982)  found  that  changes  in  F2  were  the  most  important,  followed  by 
changes  in  FI,  Bl,  B2,  F3,  and  B3.  The  distance  measures  described  in 
Table  S.9  appeared  to  be  most  sensitive  to  changes  in  P2,  which  is 
coherent  with  our  knowledge  of  the  human  auditory  system.  The  most 

measures.  Then  changes  in  F3  appeared  to  be  the  next  most  important 
changes,  whereas  changes  in  Bl  were  the  least  important.  The 


study  (1982) 


prsssni  study 


dsscrspancles  betveen  Klatt's 
most  probably  due  to  the  fact  that  Klatt  (1982)  used  the  vouel  /a/, 
vhereas  ve  used  the  vowel  /!/. 

Our  results  suggest  that  most  of  the  distortion  measures  ve  used 
ere  a good  representation  of  the  human  auditory  system,  except  perhaps 
the  Euclidian  distance  measure,  the  latter  does  not  seem  to  perform 
veil  vhen  synthetic  voices  are  used. 


In  this  chapter,  several  distortion  measures  previously  described 
in  the  literature  were  implemented.  None  of  these  measures  correlated 
veil  with  the  results  of  the  listening  test  described  in  Chapter  6, 

natural  voices.  A new  Euclidian  distance  measure  was  designed,  based  on 
the  various  acoustic  parameters  described  in  Chapter  A.  The  coefficient 
of  correlation  between  the  distances  obtained  using  this  new  distortion 
measure  and  the  listening  test  scores  vas  high  for  each  of  the  five 
voice  qualities  studied  (l.e,  overall  severity,  hoarseness,  breathiness, 
roughness,  and  vocal  fry). 


Vhen  using  synthetic  voices,  the  results  ' 
results  obtained  using  the  various  distortion  mess 

studies.  For  the  human  auditory  system,  changes  i 
are  much  more  important  than  changes  in  formant 
raflacted  by  the  various  distortion  measures,  Ai 
changes  in  P2  were  the  most  important  vith  respect 


tere  different.  The 
uras  described  at  the 

n formant  frequencies 
to  variations  in  the 


distances.  The  one  neesure  that  did  not  pecfoce  veil  in  this  expeciment 
vas  the  Euclidian  distance  neasuce.  This  is  probably  due  to  Che  fact 
that  the  acoustic  paraseters  on  vhich  this  measure  Is  based  are  very 
different  for  syniheiic  voices  than  for  natural  voices  (e.g.  the 
synthetic  voices  ve  used  had  aleost  no  jitter  end  shlDBieri  vhich  is  not 
the  case  of  natural  voices).  Hence,  if  coders  or  synthesisers  ate  to  be 
tested,  this  Euclidian  distance  eeasure  is  more  a neasuce  of  naturalness 

Except  for  Che  Euclidian  eeasure.  the  results  obtained  vith  the 
ocher  distortion  eeasures  appeared  to  agree  with  our  knovledge  of  the 
speech  perception  process.  However.  Che  letter  is  still  insufficient, 
and  more  needs  to  be  knovn  before  measures  that  replicate  this  process 
can  be  implemented. 


CONCLUSIONS 


Tvo  major  accomplishtiants  vere  achieved  in  this  study.  First,  ve 

voice  Qualities,  and  second,  ve  created  a distortion  measure  chat  oan 
adequately  compute  the  distance  betveen  two  natural  voices.  Ve  also 
identified  several  distortion  eeasures  choc  can  replicate  soae  aspects 
of  Che  human  auditory  process. 


Host  of  the  acoustic  parameters  ve  used  In  this  study  vote 
extracted  free  the  residue  signal,  obtained  by  Inverse  filtering  the 

/!/.  The  originality  of  this  study  consisted  In  Che  number  of  acoustic 
paramececa  considered  (12)  and  in  the  vide  range  of  voice  qualities 
found  In  our  voice  data  base  (Chat  consisted  of  52  noroial  voices  and  50 
pathologic  voices).  The  parameters  that  did  not  show  any  statistical 


encouraging  results.  One  of 


te  also  studied  the  capability  o 
.veen  genders,  and  some  of  Chen 
e problems  ve  faced  In  this  stud) 


find  Che  optimal  conditions  of  analysis  to  extract  the  various 
parameters.  This  problem  has  often  been  overlooked  in  previous  studies. 


M cycles  needed  to  be  artalyeed  to  correctly  conpute  the  perturbation 
neasures.  and  that  the  inverse  filter  coefficients  needed  to  be  updated 
every  frame  (l.e.  25.6  ins).  Using  a general  purpose  formant 
synthesizer,  ve  also  studied  the  relations  betveen  some  of  the  acoustic 

Amplitude  Perturbation  Quotient  (APO),  and  flareonlcs-to-Nalse  Ratio 
<HHR)  vere  closely  related,  sc  least  for  synlhelie  voices.  The  effects 
of  various  glottal  sources  on  synthetic  voices  vere  also  studied,  and  ve 

qualities  such  as  breethiness.  The  variations  of  the  Pant  model 
paraaeters  did  not  have  much  effect  on  the  acoustic  measures. 

An  essential  part  of  this  study  vas  the  pitch  detection  scheme, 
since  so  many  acoustic  measures  depended  on  the  pitch  period.  Ve  used 
the  location  of  the  first  peak  of  the  residue  signal  autocorrelation 
(not  counting  the  peak  at  the  origin)  as  an  Indicator  of  the  pitch 
period.  Ve  then  compared  our  results  to  the  results  obtained  using  the 
differentiated  electroglotlograph  (DECC),  and  found  that  our  pitch 


detector  vas  very  reliable.  The  DEGG  vas  also  used  to  compute  some  of 
the  perturbation  measures,  but  the  lack  of  clear  periodicity  of  this 
signal  in  some  pathologic  cases  made  this  method  unreliable. 


The  next  step  vas  then  to  correlate  the  results  obtained  from  the 
acoustic  neasures  to  the  results  of  a subjective  listening  test.  At 
this  point,  only  the  six  more  efficient  measures  had  been  retained  (i.e. 
PPQi  APQ,  Spectral  flatness  of  the  Residue  signal.  Pitch  Amplitude,  HNR, 
and  percent  jitter).  Ve  implemented  a listening  test  vhere  seven  judges 
vere  asked  to  rate  the  30  pathologic 


three  females) 


voices  on  a 1 CO  7 scale  Cor  Che  five  Colloving  qualities:  overall 


The  results  concerning  the  noroal  voices  were  not  conclusive.  The 
juCges  appeared  Co  rely  on  dlCCereni  secs  of  criteria,  and  no  general 
conclusion  could  be  reached.  This  vas  not  Che  ease,  however,  Cor 


Che  first  peak  1 
lealn  predictor. 


?or  all  the  voice  qualities  studied,  except 
:h  Anplltude  <PA),  which  represents  the  aepllcude  of 
he  autocorrelation  of  the  residue  signal,  vas  Che 

n predictors.  The  coefficients  of  correlation  were 


relatively  high,  suggesting  cl 
adequate.  Ve  were  hence  abl 
given  sec  of  paraneters  (e. 


the  sec  of  acouecic  paraoecers  used  vas 
0 characterise  each  voice  quality  by  a 


perturbation  measures 
Is  characterized  by  a 
Chat  the  quality  of  a 


d a high  HNR,  whereas  : 
I PA  and  a low  HNK).  T 
>lce  depends  more  on  chi 


severely  deviant  v 
Is  scudy  clearly  sh 
characcerlsclca  of 


glottal  source  than  on  spectral  characteristics. 

In  judging  quality,  It  la  also  ieporcanc  to  be  able  to  coapare  two 
voices.  This  Is  the  goal  of  distortion  measures,  often  called  distance 
measures,  even  though  they  do  not  respect  Che  lavs  of  distance  metrics. 
Ve  Implemented  several  of  these  distortion  eeasures,  most  of  vhloh  bad 
previously  been  studied  for  speech  recognition  applications  rather  than 
for  speech  quelity  evaluation.  Using  these  distortion  eeasures.  ve 
computed  two  sets  of  distances,  one  composed  of  distances  between  high- 
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quality  voices,  and  tile  other  of  distances  between  high-quality  and  very 
deviant  voices.  The  distortion  measures  that  shoved  statistically 
different  results  between  these  two  sets  of  distances  were  kept  for  the 
remainder  of  the  study.  This  left  us  with  seven  uncorrelated  distortion 
messures.  Ve  then  considered  the  hlghest-quality  voice  in  our  data 


base,  and  computed  the  distance  between  this  voice  and  the  30  pathologic 
voices,  This  approach  was  new,  since  very  few  studies  have  coepuied 
distance  eeasures  between  two  natural  voices  (most  of  the  time, 
distortion  measures  are  used  to  compute  the  distance  between  a 
synthesized  or  a coded  voice  and  a natural  voice).  The  results  were 
then  correlated  to  the  scores  of  the  listening  test  implemented 
previously.  The  results  were  very  poor.  Indicating  that  these 


distortion  a 


3 do  not  adequately  compute 
t spectral  measures,  and  ve 


ice  all  the  distortion 
previously  shoved  that 


ie  Slain  correlate  of  voice  quality, 
gained  froe  the  acoustic  peremeters 
to  design  a new  Suclldlan  distance 


spectral  characteristics  a 

Using  the  knowledge 
described  previously,  ve  vent  on 
measure  based  on  these  parameters.  The  distances  computed  using  this 
Euclidian  measure  were  highly  correlated  to  the  listening  test  scores, 
for  the  five  voice  qualities  considered. 

Vhen  synthetic  voices  were  used,  ve  found  that  this  Euclidian 
distance  measure  did  not  give  very  good  results.  This  is  due  to  the 
fact  that  vhen  non-natural  voices  are  used,  this  measure  is  more  a 
measure  of  naturalness  than  of  quality,  since 


different  for  synthetic  voices  than 


for  natural  voices.  The  synthesis  vas  achieved  using  a fixed  excitation 
source,  vhlch  explains  the  poor  results  obtained  vlth  the  Euclidian 
aeasure.  which  is  based  on  source  parameters.  On  the  other  hand,  the 


other  distortion  measures  seemed 
whereas  bandwidths  variations  are 


to  agree  vlth  the  few  notions  we  have 


6.2  Future  Research 

adequate  set  of  acoustic  measures  that  can  reliably  predict  voice 
quality.  Ve  started  with  fourteen  measures,  but  many  eore  have  been 
reported  in  the  literature  (such  as  the  Long-Tlee  Average  Spectrue  or 
LTAS),  and  many  eore  are  In  development.  One  of  the  limitations  of  our 
study  vas  the  use  of  the  vowel  /!/  recorded  during  about  2sec.,  whereas 
some  measures  such  as  the  LTAS  require  longer  recordings.  The  use  of 

levels  of  processing  such  as  voice/unvoiced/mixed/silence  detection. 
The  electtoglottogtaph  (EGG)  is  being  studied  to  develop  algorithms  that 

can  predict  quality.  It  Is  likely  that  the  addition  of  new  efficient 


study  will  further  enhance  our  capability  to  predict  voice  quality, 
especially  for  normal  voices. 


used  a lOkEz  sampling  frequency,  but  the  effect 
frequencies  on  the  computation  of  perturbation  meas 


of  higher  sampling 
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and  shlmaat  needs  to  be  siudlad  in  depth,  The  iipottance  e(  the 
sasipllng  frequency  is  still  an  open  subject. 

At  the  cutset  of  our  study  is  the  pcoblee  of  gender  recognition. 
Soee  of  the  acoustic  aaasures  ve  iepleeented  clearly  shoved  an  ability 
to  discclelnate  between  iiale  and  female  voices.  Bovever,  they  need  to 
be  tasted  on  a larger  data  base,  and  using  other  phoneees  than  the  vovel 
/!/,  Slellarly,  repeated  measureiaents  for  a same  voice  oust  he  taken, 
in  order  to  study  the  scablllty  of  the  various  eveasures  with  tine,  and 
the  possibility  of  using  then  for  speaker  recognition  applications. 

Finally,  one  of  the  main  objects  of  any  future  research  should  be  a 
better  coeprehenslon  of  the  human  perception  systen.  This  ia  essential 
to  the  developoent  of  new  distortion  measures.  Measures  such  as  the 
Klett  distance  aeasure  ace  based  on  our  knowledge  of  the  human 
perception  systen,  but  the  Utter  la  so  scarce  that  ve  ate  still  looking 
for  an  "ideal"  measure.  The  Klatt  measure,  for  example,  showed  good 
results  for  synthetic  voices,  but  not  for  natural  voices.  It  ia  likely 

ve  perceive  speech  sill  help  us  to  develop  measures  that  can  more  closer 
match  this  perception  process.  It  is  clear  that  to  achieve  real 
progress  in  the  future,  speech  scientists  involved  in  the  field  of 
speech  quality  must  work  closely  with  speech  pathologists  and 
otolaryngologists. 
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